什么时候汇编比C快?

了解汇编程序的原因之一是，有时可以使用汇编程序来编写比用高级语言(特别是C语言)编写的代码性能更好的代码。然而，我也听人说过很多次，尽管这并非完全错误，但实际上可以使用汇编程序来生成性能更好的代码的情况极其罕见，并且需要汇编方面的专业知识和经验。

这个问题甚至没有涉及到这样一个事实，即汇编程序指令将是特定于机器的、不可移植的，或者汇编程序的任何其他方面。当然，除了这一点之外，了解汇编还有很多很好的理由，但这是一个需要示例和数据的具体问题，而不是关于汇编程序与高级语言的扩展论述。

谁能提供一些具体的例子，说明使用现代编译器汇编代码比编写良好的C代码更快，并且您能否用分析证据支持这一说法?我相信这些案例确实存在，但我真的很想知道这些案例到底有多深奥，因为这似乎是一个有争议的问题。

当前回答

我需要对192位或256位的每次中断进行移位操作，每50微秒发生一次。

它通过一个固定的映射(硬件限制)实现。使用C语言，制作它只需要大约10微秒。当我把它翻译到Assembler时，考虑到这个映射的特定特性，特定的寄存器缓存，并使用面向位的操作;它只花了不到3.5微秒的时间。

2009-05-24 15:28:46

其他回答

紧密循环，就像处理图像时一样，因为一张图像可能需要数百万像素。坐下来研究一下如何最好地利用有限的处理器寄存器会有很大的不同。下面是一个真实的例子:

http://danbystrom.se/2008/12/22/optimizing-away-ii/

处理器通常有一些深奥的指令，这些指令对于编译器来说太专业了，但有时汇编程序员可以很好地利用它们。以XLAT指令为例。如果您需要在循环中进行表查找，并且表限制在256字节，那么这非常棒!

更新:哦，当我们谈论一般循环时，最关键的是:编译器通常不知道常见情况下会有多少次迭代!只有程序员知道一个循环会被迭代很多次，因此用一些额外的工作来准备循环是有益的，或者如果它迭代的次数太少，以至于设置实际花费的时间比预期的迭代要长。

2009-02-23 16:07:28

只有在使用编译器不支持的特殊用途指令集时。

为了最大限度地利用具有多个管道和预测分支的现代CPU的计算能力，您需要以这样一种方式来构造汇编程序:a)人类几乎不可能编写b)甚至更不可能维护。

此外，更好的算法、数据结构和内存管理将为您提供至少一个数量级的性能，而不是在汇编中进行的微观优化。

2009-02-23 13:11:37

我曾经和一个人一起工作过，他说“如果编译器笨到不能弄清楚你要做什么，并且不能优化它，那么你的编译器就坏了，是时候换一个新的了”。我确信在某些情况下汇编程序会打败你的C代码，但是如果你发现自己经常使用汇编程序来“赢得”编译器，那么你的编译器就完蛋了。

对于编写试图强制查询计划器执行操作的“优化”SQL也是如此。如果您发现自己重新安排查询以让计划器执行您想要的操作，那么您的查询计划器就完蛋了——请更换一个新的计划器。

2009-03-03 04:26:08

我认为汇编程序更快的一般情况是，当一个聪明的汇编程序员看到编译器的输出并说“这是性能的关键路径，我可以写这个更有效”，然后那个人调整汇编程序或从头重写它。

2009-02-23 13:11:08

尽管C语言“接近”于对8位、16位、32位和64位数据的低级操作，但仍有一些C语言不支持的数学操作通常可以在某些汇编指令集中优雅地执行:

Fixed-point multiplication: The product of two 16-bit numbers is a 32-bit number. But the rules in C says that the product of two 16-bit numbers is a 16-bit number, and the product of two 32-bit numbers is a 32-bit number -- the bottom half in both cases. If you want the top half of a 16x16 multiply or a 32x32 multiply, you have to play games with the compiler. The general method is to cast to a larger-than-necessary bit width, multiply, shift down, and cast back: int16_t x, y; // int16_t is a typedef for "short" // set x and y to something int16_t prod = (int16_t)(((int32_t)x*y)>>16);` In this case the compiler may be smart enough to know that you're really just trying to get the top half of a 16x16 multiply and do the right thing with the machine's native 16x16multiply. Or it may be stupid and require a library call to do the 32x32 multiply that's way overkill because you only need 16 bits of the product -- but the C standard doesn't give you any way to express yourself. Certain bitshifting operations (rotation/carries): // 256-bit array shifted right in its entirety: uint8_t x[32]; for (int i = 32; --i > 0; ) { x[i] = (x[i] >> 1) | (x[i-1] << 7); } x[0] >>= 1; This is not too inelegant in C, but again, unless the compiler is smart enough to realize what you are doing, it's going to do a lot of "unnecessary" work. Many assembly instruction sets allow you to rotate or shift left/right with the result in the carry register, so you could accomplish the above in 34 instructions: load a pointer to the beginning of the array, clear the carry, and perform 32 8-bit right-shifts, using auto-increment on the pointer. For another example, there are linear feedback shift registers (LFSR) that are elegantly performed in assembly: Take a chunk of N bits (8, 16, 32, 64, 128, etc), shift the whole thing right by 1 (see above algorithm), then if the resulting carry is 1 then you XOR in a bit pattern that represents the polynomial.

尽管如此，除非有严重的性能限制，否则我不会求助于这些技术。正如其他人所说，汇编代码比C代码更难记录/调试/测试/维护:性能的提高伴随着一些严重的代价。

编辑:3。溢出检测在汇编中是可能的(在C中不能真正做到)，这使得一些算法更容易。

2009-02-23 14:34:56

什么时候汇编比C快?

推荐文章

最新文章

标签