我正在做一些关于数据库的研究,我正在研究关系数据库的一些局限性。

我得到的大型表的连接是非常昂贵的,但我不完全确定为什么。DBMS需要做什么来执行连接操作,瓶颈在哪里? 非正规化如何有助于克服这一费用?其他优化技术(例如索引)有什么帮助?

欢迎有个人经验!如果你打算发布资源链接,请避免使用维基百科。我已经知道去哪找了。

与此相关,我想知道云服务数据库(如BigTable和SimpleDB)使用的非规范化方法。看这个问题。


当前回答

大多数评论者没有注意到的是,在复杂的RDBMS中可以使用广泛的连接方法,而非规范化者总是掩盖维护非规范化数据的更高成本。并不是每个连接都基于索引,数据库有许多用于连接的优化算法和方法,旨在降低连接成本。

在任何情况下,连接的成本取决于它的类型和其他一些因素。它一点也不贵——举个例子。

A hash join, in which bulk data is equijoined, is very cheap indeed, and the cost only become significant if the hash table cannot be cached in memory. No index required. Equi-partitioning between the joined data sets can be a great help. The cost of a sort-merge join is driven by the cost of the sort rather than the merge -- an index-based access method can virtually eliminate the cost of the sort. The cost of a nested loop join on an index is driven by the height of the b-tree index and the access of the table block itself. It's fast, but not suitable for bulk joins. A nested loop join based on a cluster is much cheaper, with fewer logicAL IO'S required per join row -- if the joined tables are both in the same cluster then the join becomes very cheap through the colocation of joined rows.

数据库是为连接而设计的,它们在如何进行连接方面非常灵活,通常性能非常好,除非连接机制出错。

其他回答

连接表的顺序非常重要。如果您有两组数据,请尝试以一种方式构建查询,因此将首先使用最小的数据,以减少查询必须处理的数据量。

对于一些数据库来说,这并不重要,例如MS SQL在大多数情况下知道正确的连接顺序。 对于某些(如IBM Informix)来说,顺序决定了一切。

当您考虑连接的复杂性类时,决定是去规范化还是规范化是相当简单的过程。例如,当查询为O(k log n)时,我倾向于用规范化设计数据库,其中k相对于所需的输出大小。

反规格化和优化性能的一个简单方法是考虑对规范化结构的更改如何影响反规格化结构。然而,这可能会有问题,因为它可能需要事务逻辑才能在非规范化的结构化上工作。

关于正常化和非正常化的争论不会结束,因为问题是巨大的。许多问题的自然解决方案需要两种方法。

作为一般规则,我总是存储可以重构的规范化结构和非规范化缓存。最终,这些缓存帮我解决了未来的规范化问题。

大多数评论者没有注意到的是,在复杂的RDBMS中可以使用广泛的连接方法,而非规范化者总是掩盖维护非规范化数据的更高成本。并不是每个连接都基于索引,数据库有许多用于连接的优化算法和方法,旨在降低连接成本。

在任何情况下,连接的成本取决于它的类型和其他一些因素。它一点也不贵——举个例子。

A hash join, in which bulk data is equijoined, is very cheap indeed, and the cost only become significant if the hash table cannot be cached in memory. No index required. Equi-partitioning between the joined data sets can be a great help. The cost of a sort-merge join is driven by the cost of the sort rather than the merge -- an index-based access method can virtually eliminate the cost of the sort. The cost of a nested loop join on an index is driven by the height of the b-tree index and the access of the table block itself. It's fast, but not suitable for bulk joins. A nested loop join based on a cluster is much cheaper, with fewer logicAL IO'S required per join row -- if the joined tables are both in the same cluster then the join becomes very cheap through the colocation of joined rows.

数据库是为连接而设计的,它们在如何进行连接方面非常灵活,通常性能非常好,除非连接机制出错。

I think the whole question is based on a false premise. Joins on large tables are not necessarily expensive. In fact, doing joins efficiently is one of the main reasons relational databases exist at all. Joins on large sets often are expensive, but very rarely do you want to join the entire contents of large table A with the entire contents of large table B. Instead, you write the query such that only the important rows of each table are used and the actual set kept by the join remains smaller.

此外,您还具有Peter Wone提到的效率,这样在最终结果集物化之前,每条记录的重要部分只需要存储在内存中。此外,在具有许多连接的大型查询中,您通常希望从较小的表集开始,然后逐步到较大的表集,以便内存中保留的表集尽可能长时间地保持较小。

如果操作得当,联接通常是比较、组合或筛选大量数据的最佳方式。

详述别人说过的话,

Joins are just cartesian products with some lipgloss. {1,2,3,4}X{1,2,3} would give us 12 combinations (nXn=n^2). This computed set acts as a reference on which conditions are applied. The DBMS applies the conditions (like where both left and right are 2 or 3) to give us the matching condition(s). Actually it is more optimised but the problem is the same. The changes to size of the sets would increase the result size exponentially. The amount of memory and cpu cycles consumed all are effected in exponential terms.

当我们非规格化时,我们完全避免了这种计算,就像在你的书的每一页上贴上一个彩色的粘性纸。你可以不使用参考而推断出信息。我们付出的代价是,我们损害了DBMS的本质(数据的最佳组织)。