最后的性能优化策略

在这个网站上已经有很多性能问题了，但是在我看来，几乎所有的问题都是非常具体的，而且相当狭窄。几乎所有人都重复了避免过早优化的建议。

我们假设:

代码已经正常工作了所选择的算法对于问题的环境已经是最优的对代码进行了测量，并隔离了有问题的例程所有优化的尝试也将被衡量，以确保它们不会使事情变得更糟

我在这里寻找的是策略和技巧，在一个关键算法中，当没有其他事情可做，但无论如何都要挤出最后百分之几。

理想情况下，尽量让答案与语言无关，并在适用的情况下指出所建议的策略的任何缺点。

我将添加一个带有我自己最初建议的回复，并期待Stack Overflow社区能想到的任何其他东西。

当前回答

虽然我喜欢Mike Dunlavey的回答，但事实上这是一个很好的答案，并且有支持的例子，我认为它可以简单地表达出来:

首先找出哪些事情最耗费时间，并了解原因。

它是时间消耗的识别过程，可以帮助您了解必须在哪里改进算法。这是我能找到的唯一一个全面的语言不可知论答案，这个问题已经被认为是完全优化的。同时假设您希望在追求速度的过程中独立于体系结构。

因此，虽然算法可能被优化了，但它的实现可能没有。标识可以让您知道哪个部分是哪个部分:算法或实现。所以，占用时间最多的就是你审查的首选对象。但是既然你说你想把最后的%挤出来，你可能还想检查一下较小的部分，那些你一开始没有仔细检查过的部分。

最后，对实现相同解决方案的不同方法的性能数据进行一些尝试和错误，或者可能的不同算法，可以带来有助于识别浪费时间和节省时间的见解。

HPH, asoudmove。

2011-01-26 04:35:20

其他回答

OK, you're defining the problem to where it would seem there is not much room for improvement. That is fairly rare, in my experience. I tried to explain this in a Dr. Dobbs article in November 1993, by starting from a conventionally well-designed non-trivial program with no obvious waste and taking it through a series of optimizations until its wall-clock time was reduced from 48 seconds to 1.1 seconds, and the source code size was reduced by a factor of 4. My diagnostic tool was this. The sequence of changes was this:

The first problem found was use of list clusters (now called "iterators" and "container classes") accounting for over half the time. Those were replaced with fairly simple code, bringing the time down to 20 seconds. Now the largest time-taker is more list-building. As a percentage, it was not so big before, but now it is because the bigger problem was removed. I find a way to speed it up, and the time drops to 17 seconds. Now it is harder to find obvious culprits, but there are a few smaller ones that I can do something about, and the time drops to 13 sec.

现在我似乎遇到了瓶颈。样本告诉我它到底在做什么，但我似乎找不到任何可以改进的地方。然后，我考虑了程序的基本设计及其事务驱动结构，并询问它所做的所有列表搜索实际上是否都是由问题的需求强制执行的。

然后我偶然发现了一种重新设计，在这种设计中，程序代码实际上是从一组较小的源代码中生成的(通过预处理器宏)，在这种设计中，程序不会不断地找出程序员知道的相当可预测的事情。换句话说，不要“解释”要做的事情的顺序，要“编译”它。

重新设计完成了，源代码缩减了1 / 4，时间减少到10秒。

现在，因为它变得如此之快，很难进行抽样，所以我给它10倍的工作，但下面的时间是基于原始工作负载的。

进一步的诊断表明，它是在队列管理上花费时间的。内联这些将时间缩短到7秒。现在一个很大的时间消耗是我一直在做的诊断打印。冲水- 4秒现在最浪费时间的是调用malloc和free。回收对象- 2.6秒。继续进行抽样，我仍然发现了严格意义上没有必要的操作——1.1秒。

总加速系数:43.6

Now no two programs are alike, but in non-toy software I've always seen a progression like this. First you get the easy stuff, and then the more difficult, until you get to a point of diminishing returns. Then the insight you gain may well lead to a redesign, starting a new round of speedups, until you again hit diminishing returns. Now this is the point at which it might make sense to wonder whether ++i or i++ or for(;;) or while(1) are faster: the kinds of questions I see so often on Stack Overflow.

附注:可能有人想知道我为什么不用侧写器。答案是，几乎所有这些“问题”都是函数调用站点，堆栈样本可以精确定位。即使在今天，分析人员也只是勉强接受这样一个观点:语句和调用指令比整个函数更重要，更容易定位，也更容易修复。

我实际上构建了一个剖析器来做这件事，但是要真正了解代码正在做什么，没有什么可以替代您的手指。样本数量少并不是问题，因为被发现的问题没有一个小到容易被忽略的程度。

添加:jerryjvl要求一些例子。这是第一个问题。它由少量独立的代码行组成，加在一起占用了一半的时间:

 /* IF ALL TASKS DONE, SEND ITC_ACKOP, AND DELETE OP */
if (ptop->current_task >= ILST_LENGTH(ptop->tasklist){
. . .
/* FOR EACH OPERATION REQUEST */
for ( ptop = ILST_FIRST(oplist); ptop != NULL; ptop = ILST_NEXT(oplist, ptop)){
. . .
/* GET CURRENT TASK */
ptask = ILST_NTH(ptop->tasklist, ptop->current_task)

These were using the list cluster ILST (similar to a list class). They are implemented in the usual way, with "information hiding" meaning that the users of the class were not supposed to have to care how they were implemented. When these lines were written (out of roughly 800 lines of code) thought was not given to the idea that these could be a "bottleneck" (I hate that word). They are simply the recommended way to do things. It is easy to say in hindsight that these should have been avoided, but in my experience all performance problems are like that. In general, it is good to try to avoid creating performance problems. It is even better to find and fix the ones that are created, even though they "should have been avoided" (in hindsight). I hope that gives a bit of the flavor.

下面是第二个问题，分两行:

 /* ADD TASK TO TASK LIST */
ILST_APPEND(ptop->tasklist, ptask)
. . .
/* ADD TRANSACTION TO TRANSACTION QUEUE */
ILST_APPEND(trnque, ptrn)

它们通过在列表的末尾附加项目来构建列表。(解决方法是将项目收集到数组中，并一次性构建列表。)有趣的是，这些语句只花费了原始时间的3/48(即在调用堆栈上)，所以它们实际上在一开始并不是一个大问题。然而，在消除了第一个问题后，它们只花费了3/20的时间，所以现在是一条“大鱼”。总的来说，就是这样。

我可以补充说，这个项目是从我参与的一个真实项目中提炼出来的。在那个项目中，性能问题要严重得多(加速也是如此)，比如在内部循环中调用数据库访问例程来查看任务是否完成。

参考补充道: 源代码，无论是原始的还是重新设计的，都可以在www.ddj.com上找到，1993年，文件9311.zip, files slug。Asc和slug.zip。

编辑2011/11/26: 现在有一个SourceForge项目包含了Visual c++中的源代码，以及它是如何调优的详细描述。它只经历了上述场景的前半部分，并不完全遵循相同的顺序，但仍然获得了2-3个数量级的加速。

2009-05-29 19:41:53

有时改变数据的布局会有所帮助。在C语言中，可以从数组或结构切换到数组结构，反之亦然。

2009-05-29 22:20:29

我大半辈子都在这里度过。大致的方法是运行你的分析器并记录它:

Cache misses. Data cache is the #1 source of stalls in most programs. Improve cache hit rate by reorganizing offending data structures to have better locality; pack structures and numerical types down to eliminate wasted bytes (and therefore wasted cache fetches); prefetch data wherever possible to reduce stalls. Load-hit-stores. Compiler assumptions about pointer aliasing, and cases where data is moved between disconnected register sets via memory, can cause a certain pathological behavior that causes the entire CPU pipeline to clear on a load op. Find places where floats, vectors, and ints are being cast to one another and eliminate them. Use __restrict liberally to promise the compiler about aliasing. Microcoded operations. Most processors have some operations that cannot be pipelined, but instead run a tiny subroutine stored in ROM. Examples on the PowerPC are integer multiply, divide, and shift-by-variable-amount. The problem is that the entire pipeline stops dead while this operation is executing. Try to eliminate use of these operations or at least break them down into their constituent pipelined ops so you can get the benefit of superscalar dispatch on whatever the rest of your program is doing. Branch mispredicts. These too empty the pipeline. Find cases where the CPU is spending a lot of time refilling the pipe after a branch, and use branch hinting if available to get it to predict correctly more often. Or better yet, replace branches with conditional-moves wherever possible, especially after floating point operations because their pipe is usually deeper and reading the condition flags after fcmp can cause a stall. Sequential floating-point ops. Make these SIMD.

我还喜欢做一件事:

将编译器设置为输出程序集清单，并查看它为代码中的热点函数发出了什么。所有那些聪明的优化，“一个好的编译器应该能够自动为你做”?实际的编译器可能不会执行这些操作。我见过GCC发出真正的WTF代码。

2009-05-29 22:19:44

由于许多性能问题都涉及数据库问题，因此在调优查询和存储过程时，我将介绍一些需要注意的具体问题。

避免在大多数数据库中使用游标。也要避免循环。大多数时候，数据访问应该基于设置，而不是逐条记录处理。这包括当您希望一次插入1,000,000条记录时，不要重用单个记录存储过程。

不要使用select *，只返回实际需要的字段。如果存在任何连接，则尤其如此，因为连接字段将重复，从而在服务器和网络上造成不必要的负载。

避免使用相关的子查询。使用连接(尽可能包括到派生表的连接)(我知道这对于Microsoft SQL Server是正确的，但是在使用不同的后端时测试建议)。

索引，索引，索引。如果适用于您的数据库，请更新这些统计数据。

使查询sargable。这意味着避免一些不可能使用索引的事情，例如在like子句的第一个字符中使用通配符，或在join中的函数中使用通配符，或作为where语句的左侧部分。

使用正确的数据类型。在日期字段上进行日期计算要比尝试将字符串数据类型转换为日期数据类型然后进行计算快得多。

永远不要在触发器中放入任何形式的循环!

大多数数据库都有一种方法来检查如何执行查询。在Microsoft SQL Server中，这被称为执行计划。先检查一下，看看问题出在哪里。

在确定需要优化的内容时，考虑查询运行的频率以及运行所需的时间。有时，对一个每天运行数百万次的查询稍作调整，可以获得比删除一个月只运行一次的long_running查询更多的性能。

使用某种分析器工具来找出发送到数据库和从数据库发送的内容。我记得过去有一次，我们不知道为什么页面加载这么慢，而存储过程却很快，并通过分析发现网页多次而不是一次地请求查询。

剖析器还将帮助您找到谁在阻止谁。一些单独运行时执行很快的查询可能会因为来自其他查询的锁而变得非常慢。

2009-05-29 21:41:20