什么时候汇编比C快?

了解汇编程序的原因之一是，有时可以使用汇编程序来编写比用高级语言(特别是C语言)编写的代码性能更好的代码。然而，我也听人说过很多次，尽管这并非完全错误，但实际上可以使用汇编程序来生成性能更好的代码的情况极其罕见，并且需要汇编方面的专业知识和经验。

这个问题甚至没有涉及到这样一个事实，即汇编程序指令将是特定于机器的、不可移植的，或者汇编程序的任何其他方面。当然，除了这一点之外，了解汇编还有很多很好的理由，但这是一个需要示例和数据的具体问题，而不是关于汇编程序与高级语言的扩展论述。

谁能提供一些具体的例子，说明使用现代编译器汇编代码比编写良好的C代码更快，并且您能否用分析证据支持这一说法?我相信这些案例确实存在，但我真的很想知道这些案例到底有多深奥，因为这似乎是一个有争议的问题。

当前回答

在我的工作中，有三个原因让我了解和使用组装。按重要性排序:

Debugging - I often get library code that has bugs or incomplete documentation. I figure out what it's doing by stepping in at the assembly level. I have to do this about once a week. I also use it as a tool to debug problems in which my eyes don't spot the idiomatic error in C/C++/C#. Looking at the assembly gets past that. Optimizing - the compiler does fairly well in optimizing, but I play in a different ballpark than most. I write image processing code that usually starts with code that looks like this: for (int y=0; y < imageHeight; y++) { for (int x=0; x < imageWidth; x++) { // do something } } the "do something part" typically happens on the order of several million times (ie, between 3 and 30). By scraping cycles in that "do something" phase, the performance gains are hugely magnified. I don't usually start there - I usually start by writing the code to work first, then do my best to refactor the C to be naturally better (better algorithm, less load in the loop etc). I usually need to read assembly to see what's going on and rarely need to write it. I do this maybe every two or three months. doing something the language won't let me. These include - getting the processor architecture and specific processor features, accessing flags not in the CPU (man, I really wish C gave you access to the carry flag), etc. I do this maybe once a year or two years.

2009-02-23 16:22:00

其他回答

长波克，只有一个限制时间。当你没有足够的资源来优化每一个代码的变化，并花时间分配寄存器，优化一些溢出和诸如此类的事情时，编译器每次都会赢。对代码进行修改、重新编译和度量。如有必要重复。

此外，你可以在高水平方面做很多事情。此外，检查生成的程序集可能会给人一种代码是垃圾的印象，但实际上它的运行速度比您想象的要快。例子:

Int y = data[i]; //在这里做一些事情。 call_function (y,…);

编译器将读取数据，将其推入堆栈(溢出)，然后从堆栈读取并作为参数传递。听起来屎?它实际上可能是非常有效的延迟补偿，并导致更快的运行时。

//优化版本 call_function(数据[我],…);//毕竟不是那么优化。

优化版本的想法是，我们降低了寄存器压力，避免溢出。但事实上，“垃圾”版本更快!

看看汇编代码，只看指令，然后得出结论:指令越多，速度越慢，这将是一个错误的判断。

这里需要注意的是:许多组装专家认为他们知道很多，但知道的很少。规则也会随着架构的变化而变化。例如，x86代码并不存在总是最快的银弹。如今，最好还是按照经验法则行事:

记忆很慢缓存速度快尽量更好地使用缓存你多久会错过一次?你有延迟补偿策略吗? 对于一个cache miss，你可以执行10-100个ALU/FPU/SSE指令应用程序架构很重要。 . .但是当问题不在架构上时，它就没有帮助了

此外，过于相信编译器会神奇地将考虑不周到的C/ c++代码转换为“理论上最优”的代码是一厢情愿的想法。如果你关心这个低级别的“性能”，你必须知道你使用的编译器和工具链。

C/ c++中的编译器通常不太擅长重新排序子表达式，因为对于初学者来说，函数有副作用。函数式语言没有受到这个警告的影响，但它不太适合当前的生态系统。有一些编译器选项可以允许宽松的精确规则，允许编译器/链接器/代码生成器改变操作的顺序。

这个话题有点死路一条;对于大多数人来说，这是无关紧要的，而剩下的人，他们已经知道自己在做什么了。

这一切都归结为:“理解你在做什么”，这与知道你在做什么有点不同。

2010-09-17 13:12:59

C语言常常需要做一些从汇编编码员的角度看来不必要的事情，这只是因为C标准这么说。

例如，整数提升。如果你想在C语言中移动一个char变量，人们通常会期望代码实际上只做一个比特的移动。

然而，标准强制编译器在移位之前将符号扩展为int，然后将结果截断为char，这可能会使代码复杂化，这取决于目标处理器的架构。

2014-03-15 13:41:19

尽管C语言“接近”于对8位、16位、32位和64位数据的低级操作，但仍有一些C语言不支持的数学操作通常可以在某些汇编指令集中优雅地执行:

Fixed-point multiplication: The product of two 16-bit numbers is a 32-bit number. But the rules in C says that the product of two 16-bit numbers is a 16-bit number, and the product of two 32-bit numbers is a 32-bit number -- the bottom half in both cases. If you want the top half of a 16x16 multiply or a 32x32 multiply, you have to play games with the compiler. The general method is to cast to a larger-than-necessary bit width, multiply, shift down, and cast back: int16_t x, y; // int16_t is a typedef for "short" // set x and y to something int16_t prod = (int16_t)(((int32_t)x*y)>>16);` In this case the compiler may be smart enough to know that you're really just trying to get the top half of a 16x16 multiply and do the right thing with the machine's native 16x16multiply. Or it may be stupid and require a library call to do the 32x32 multiply that's way overkill because you only need 16 bits of the product -- but the C standard doesn't give you any way to express yourself. Certain bitshifting operations (rotation/carries): // 256-bit array shifted right in its entirety: uint8_t x[32]; for (int i = 32; --i > 0; ) { x[i] = (x[i] >> 1) | (x[i-1] << 7); } x[0] >>= 1; This is not too inelegant in C, but again, unless the compiler is smart enough to realize what you are doing, it's going to do a lot of "unnecessary" work. Many assembly instruction sets allow you to rotate or shift left/right with the result in the carry register, so you could accomplish the above in 34 instructions: load a pointer to the beginning of the array, clear the carry, and perform 32 8-bit right-shifts, using auto-increment on the pointer. For another example, there are linear feedback shift registers (LFSR) that are elegantly performed in assembly: Take a chunk of N bits (8, 16, 32, 64, 128, etc), shift the whole thing right by 1 (see above algorithm), then if the resulting carry is 1 then you XOR in a bit pattern that represents the polynomial.

尽管如此，除非有严重的性能限制，否则我不会求助于这些技术。正如其他人所说，汇编代码比C代码更难记录/调试/测试/维护:性能的提高伴随着一些严重的代价。

编辑:3。溢出检测在汇编中是可能的(在C中不能真正做到)，这使得一些算法更容易。

2009-02-23 14:34:56

使用SIMD指令的矩阵操作可能比编译器生成的代码更快。

2009-02-23 13:06:09

CP/M-86版本的PolyPascal (Turbo Pascal的兄弟)的一个可能性是用机器语言例程取代“使用生物将字符输出到屏幕上”的功能，本质上是给定x、y和字符串放在那里。

这使得更新屏幕的速度比以前快得多!

二进制文件中有足够的空间来嵌入机器代码(几百个字节)，也有其他的东西，所以尽可能多地压缩是必要的。

事实证明，由于屏幕是80x25，这两个坐标都可以容纳每个字节，所以都可以容纳两个字节的单词。这允许在更少的字节内完成所需的计算，因为单个添加可以同时操作两个值。

据我所知，没有C编译器可以在一个寄存器中合并多个值，对它们执行SIMD指令，然后再将它们分开(而且我不认为机器指令会更短)。

2009-02-23 14:15:01

什么时候汇编比C快?

推荐文章

最新文章

标签