什么时候汇编比C快?

了解汇编程序的原因之一是，有时可以使用汇编程序来编写比用高级语言(特别是C语言)编写的代码性能更好的代码。然而，我也听人说过很多次，尽管这并非完全错误，但实际上可以使用汇编程序来生成性能更好的代码的情况极其罕见，并且需要汇编方面的专业知识和经验。

这个问题甚至没有涉及到这样一个事实，即汇编程序指令将是特定于机器的、不可移植的，或者汇编程序的任何其他方面。当然，除了这一点之外，了解汇编还有很多很好的理由，但这是一个需要示例和数据的具体问题，而不是关于汇编程序与高级语言的扩展论述。

谁能提供一些具体的例子，说明使用现代编译器汇编代码比编写良好的C代码更快，并且您能否用分析证据支持这一说法?我相信这些案例确实存在，但我真的很想知道这些案例到底有多深奥，因为这似乎是一个有争议的问题。

当前回答

以下是我个人经历中的几个例子:

Access to instructions that are not accessible from C. For instance, many architectures (like x86-64, IA-64, DEC Alpha, and 64-bit MIPS or PowerPC) support a 64 bit by 64 bit multiplication producing a 128 bit result. GCC recently added an extension providing access to such instructions, but before that assembly was required. And access to this instruction can make a huge difference on 64-bit CPUs when implementing something like RSA - sometimes as much as a factor of 4 improvement in performance. Access to CPU-specific flags. The one that has bitten me a lot is the carry flag; when doing a multiple-precision addition, if you don't have access to the CPU carry bit one must instead compare the result to see if it overflowed, which takes 3-5 more instructions per limb; and worse, which are quite serial in terms of data accesses, which kills performance on modern superscalar processors. When processing thousands of such integers in a row, being able to use addc is a huge win (there are superscalar issues with contention on the carry bit as well, but modern CPUs deal pretty well with it). SIMD. Even autovectorizing compilers can only do relatively simple cases, so if you want good SIMD performance it's unfortunately often necessary to write the code directly. Of course you can use intrinsics instead of assembly but once you're at the intrinsics level you're basically writing assembly anyway, just using the compiler as a register allocator and (nominally) instruction scheduler. (I tend to use intrinsics for SIMD simply because the compiler can generate the function prologues and whatnot for me so I can use the same code on Linux, OS X, and Windows without having to deal with ABI issues like function calling conventions, but other than that the SSE intrinsics really aren't very nice - the Altivec ones seem better though I don't have much experience with them). As examples of things a (current day) vectorizing compiler can't figure out, read about bitslicing AES or SIMD error correction - one could imagine a compiler that could analyze algorithms and generate such code, but it feels to me like such a smart compiler is at least 30 years away from existing (at best).

On the other hand, multicore machines and distributed systems have shifted many of the biggest performance wins in the other direction - get an extra 20% speedup writing your inner loops in assembly, or 300% by running them across multiple cores, or 10000% by running them across a cluster of machines. And of course high level optimizations (things like futures, memoization, etc) are often much easier to do in a higher level language like ML or Scala than C or asm, and often can provide a much bigger performance win. So, as always, there are tradeoffs to be made.

2009-10-15 17:07:57

其他回答

第一点不是答案。即使你从来没有用它编程，我发现至少知道一个汇编指令集是有用的。这是程序员永无止境的追求的一部分，他们想知道得更多，从而变得更好。当你进入一个没有源代码的框架时，它也很有用，至少对正在发生的事情有一个粗略的了解。它还可以帮助您理解JavaByteCode和. net IL，因为它们都类似于汇编程序。

To answer the question when you have a small amount of code or a large amount of time. Most useful for use in embedded chips, where low chip complexity and poor competition in compilers targeting these chips can tip the balance in favour of humans. Also for restricted devices you are often trading off code size/memory size/performance in a way that would be hard to instruct a compiler to do. e.g. I know this user action is not called often so I will have small code size and poor performance, but this other function that look similar is used every second so I will have a larger code size and faster performance. That is the sort of trade off a skilled assembly programmer can use.

我还想补充一点，这里有很多中间地带，您可以用C编译代码并检查生成的程序集，然后更改C代码或调整并作为程序集进行维护。

我的朋友从事微控制器的工作，目前是用于控制小型电动机的芯片。他在低级c和汇编的组合中工作。他曾经告诉我，有一天他在工作中把主循环从48条指令减少到43条。他还面临着各种选择，比如代码已经增长到填满256k芯片，业务需要一个新功能，你呢

删除现有功能减少部分或全部现有特性的大小，可能会以性能为代价。提倡改用成本更高、功耗更高、外形更大的更大芯片。

我想补充一点，作为一个商业开发人员，我有很多的投资组合或语言、平台、应用程序类型，我从来没有觉得有必要深入编写程序集。我一直都很感激我所学到的知识。有时会被调试进去。

我知道我已经回答了“为什么我要学习汇编器”这个问题，但我觉得这是一个更重要的问题，而不是什么时候更快。

所以让我们再试一次你应该考虑组装

致力于底层操作系统功能在编译器上工作。工作在一个极其有限的芯片，嵌入式系统等

记住比较你的程序集和生成的编译器，看看哪个更快/更小/更好。

大卫。

2009-02-23 13:44:14

在历史上插话。

当我还年轻的时候(20世纪70年代)，根据我的经验，汇编是很重要的，更重要的是代码的大小，而不是代码的速度。

如果一个高级语言的模块是1300字节的代码，但该模块的汇编版本是300字节，那么当您试图将应用程序装入16K或32K的内存时，这1K字节就非常重要。

那时候编译器还不是很好。

在老式的Fortran中

X = (Y - Z)
IF (X .LT. 0) THEN
 ... do something
ENDIF

当时的编译器在X上执行了一个SUBTRACT指令，然后是一个TEST指令。在汇编程序中，您只需在减法之后检查条件代码(LT零，零，GT零)。

对于现代系统和编译器来说，这些都不是问题。

我认为理解编译器在做什么仍然很重要。当您使用高级语言编写代码时，您应该了解什么允许或阻止编译器执行循环展开。

当编译器执行“类似分支”的操作时，使用管道内衬和包含条件的前瞻计算。

当执行高级语言不允许的事情时，仍然需要汇编程序，比如读取或写入处理器特定的寄存器。

但在很大程度上，普通程序员不再需要它，除非对代码如何编译和执行有基本的了解。

2019-10-20 16:38:19

如果您没有查看编译器生成的内容的反汇编，您实际上无法知道编写良好的C代码是否真的很快。很多时候你会发现“写得好”是主观的。

因此，没有必要用汇编程序来获得最快的代码，但出于同样的原因，了解汇编程序当然是值得的。

2009-02-23 13:09:46

一个更著名的组装片段来自Michael Abrash的纹理映射循环(在这里详细解释):

add edx,[DeltaVFrac] ; add in dVFrac
sbb ebp,ebp ; store carry
mov [edi],al ; write pixel n
mov al,[esi] ; fetch pixel n+1
add ecx,ebx ; add in dUFrac
adc esi,[4*ebp + UVStepVCarry]; add in steps

现在，大多数编译器将高级CPU特定指令表示为intrinsic，即编译为实际指令的函数。MS Visual c++支持MMX、SSE、SSE2、SSE3和SSE4的intrinsic，因此您不必太过担心使用特定于平台的指令来进行汇编。Visual c++还可以通过适当的/ARCH设置来利用您所针对的实际体系结构。

2009-02-23 16:17:19

这完全取决于你的工作量。

对于日常操作，C和c++已经很好了，但是有一些特定的工作负载(任何涉及视频的转换(压缩、解压缩、图像效果等))几乎需要组装才能达到性能。

它们通常还涉及使用特定于CPU的芯片组扩展(MME/MMX/SSE/等等)，这些扩展是为这些类型的操作而优化的。

2009-02-24 04:58:27

什么时候汇编比C快?

推荐文章

最新文章

标签