在C语言中使用移位运算符的乘法和除法真的更快吗?

例如，乘法和除法可以使用位运算符来实现

i*2 = i<<1
i*3 = (i<<1) + i;
i*10 = (i<<3) + (i<<1)

等等。

用say (i<<3)+(i<<1)乘10是否比直接用i*10更快?有没有什么输入是不能用这种方法乘或除的?

当前回答

Shift和整数乘法指令在大多数现代cpu上具有相似的性能——在20世纪80年代，整数乘法指令相对较慢，但通常情况下不再是这样。整数乘法指令可能有更高的延迟，所以仍然可能有移位更可取的情况。同样的情况下，你可以让更多的执行单元忙(尽管这是有利有弊)。

整数除法仍然相对较慢，所以使用shift代替2的幂除法仍然是一种胜利，大多数编译器将其作为一种优化来实现。但是请注意，要使这种优化有效，红利需要是无符号的，或者必须已知是正的。对于负红利，移位和除法是不相等的!

#include <stdio.h>

int main(void)
{
    int i;

    for (i = 5; i >= -5; --i)
    {
        printf("%d / 2 = %d, %d >> 1 = %d\n", i, i / 2, i, i >> 1);
    }
    return 0;
}

输出:

5 / 2 = 2, 5 >> 1 = 2
4 / 2 = 2, 4 >> 1 = 2
3 / 2 = 1, 3 >> 1 = 1
2 / 2 = 1, 2 >> 1 = 1
1 / 2 = 0, 1 >> 1 = 0
0 / 2 = 0, 0 >> 1 = 0
-1 / 2 = 0, -1 >> 1 = -1
-2 / 2 = -1, -2 >> 1 = -1
-3 / 2 = -1, -3 >> 1 = -2
-4 / 2 = -2, -4 >> 1 = -2
-5 / 2 = -2, -5 >> 1 = -3

所以如果你想帮助编译器，那么确保变量或表达式在被除数显式无符号。

2011-06-15 11:42:54

其他回答

移位通常比指令级的乘法快得多，但你可能会浪费时间做过早的优化。编译器可以在编译时很好地执行这些优化。自己做会影响可读性，而且可能对性能没有影响。如果您已经进行了概要分析并发现这是一个瓶颈，那么这样做可能是值得的。

实际上，这种被称为“魔法除法”的除法技巧实际上可以产生巨大的收益。同样，你应该首先分析它是否需要。但是如果你真的使用它，周围有一些有用的程序可以帮助你弄清楚相同的除法语义需要什么指令。这里有一个例子:http://www.masm32.com/board/index.php?topic=12421.0

我从MASM32上的OP线程中引用了一个例子:

include ConstDiv.inc
...
mov eax,9999999
; divide eax by 100000
cdiv 100000
; edx = quotient

会产生:

mov eax,9999999
mov edx,0A7C5AC47h
add eax,1
.if !CARRY?
    mul edx
.endif
shr edx,16

2011-06-15 11:35:10

用say (i<<3)+(i<<1)乘10是否比直接用i*10更快?

它可能在您的机器上，也可能不在您的机器上——如果您关心的话，请在您的实际使用情况中进行测量。

一个案例研究——从486到core i7

Benchmarking is very difficult to do meaningfully, but we can look at a few facts. From http://www.penguin.cz/~literakl/intel/s.html#SAL and http://www.penguin.cz/~literakl/intel/i.html#IMUL we get an idea of x86 clock cycles needed for arithmetic shift and multiplication. Say we stick to "486" (the newest one listed), 32 bit registers and immediates, IMUL takes 13-42 cycles and IDIV 44. Each SAL takes 2, and adding 1, so even with a few of those together shifting superficially looks like a winner.

如今，随着酷睿i7的出现:

(来自http://software.intel.com/en-us/forums/showthread.php?t=61481)

整数加法的延迟为1个周期，整数乘法的延迟为3个周期。您可以在“Intel®64 and IA-32架构优化参考手册”的附录C中找到延迟和吞吐量，该手册位于http://www.intel.com/products/processor/manuals/。

(来自英特尔的宣传)

使用SSE，酷睿i7可以同时发出加法和乘法指令，导致每个时钟周期有8个浮点运算(FLOP)的峰值速率

That gives you an idea of how far things have come. The optimisation trivia - like bit shifting versus * - that was been taken seriously even into the 90s is just obsolete now. Bit-shifting is still faster, but for non-power-of-two mul/div by the time you do all your shifts and add the results it's slower again. Then, more instructions means more cache faults, more potential issues in pipelining, more use of temporary registers may mean more saving and restoring of register content from the stack... it quickly gets too complicated to quantify all the impacts definitively but they're predominantly negative.

源代码中的功能vs实现

More generally, your question is tagged C and C++. As 3rd generation languages, they're specifically designed to hide the details of the underlying CPU instruction set. To satisfy their language Standards, they must support multiplication and shifting operations (and many others) even if the underlying hardware doesn't. In such cases, they must synthesize the required result using many other instructions. Similarly, they must provide software support for floating point operations if the CPU lacks it and there's no FPU. Modern CPUs all support * and <<, so this might seem absurdly theoretical and historical, but the significance thing is that the freedom to choose implementation goes both ways: even if the CPU has an instruction that implements the operation requested in the source code in the general case, the compiler's free to choose something else that it prefers because it's better for the specific case the compiler's faced with.

示例(使用假设的汇编语言)

source           literal approach         optimised approach
#define N 0
int x;           .word x                xor registerA, registerA
x *= N;          move x -> registerA
                 move x -> registerB
                 A = B * immediate(0)
                 store registerA -> x
  ...............do something more with x...............

像exclusive or (xor)这样的指令与源代码没有关系，但是用自身进行xor-ing会清除所有的位，所以它可以用来将一些东西设置为0。暗示内存地址的源代码可能不需要使用任何内存地址。

These kind of hacks have been used for as long as computers have been around. In the early days of 3GLs, to secure developer uptake the compiler output had to satisfy the existing hardcore hand-optimising assembly-language dev. community that the produced code wasn't slower, more verbose or otherwise worse. Compilers quickly adopted lots of great optimisations - they became a better centralised store of it than any individual assembly language programmer could possibly be, though there's always the chance that they miss a specific optimisation that happens to be crucial in a specific case - humans can sometimes nut it out and grope for something better while compilers just do as they've been told until someone feeds that experience back into them.

因此，即使移动和添加在某些特定的硬件上仍然更快，那么编译器编写者可能已经准确地计算出什么时候它既安全又有益。

可维护性

如果你的硬件改变了，你可以重新编译，它会查看目标CPU并做出另一个最佳选择，而你不太可能想要重新审视你的“优化”或列出哪些编译环境应该使用乘法，哪些编译环境应该移位。想想10多年前编写的所有非2位移位的“优化”，现在它们在现代处理器上运行时减慢了它们所使用的代码……!

值得庆幸的是，像GCC这样的优秀编译器通常可以在启用任何优化时用直接乘法替换一系列位移位和算术(即. ...main(…){return (argc << 4) + (argc << 2) + argc;} -> imull $ 21,8 (%ebp)， %eax)所以重新编译可能有帮助，即使不修复代码，但这是不保证的。

实现乘法或除法的奇怪位移代码远不能表达您在概念上试图实现的目标，因此其他开发人员会对此感到困惑，而困惑的程序员更有可能引入错误或删除一些必要的东西，以努力恢复表面上的理智。如果你只做那些不明显的事情，但它们确实是有实际好处的，然后好好记录它们(但不要记录其他直观的东西)，每个人都会更快乐。

通解和部分解

如果你有一些额外的知识，比如你的int将只存储值x, y和z，那么你可能能够制定出一些指令，适用于这些值，并更快地得到你的结果，而不是编译器没有洞察，需要一个实现，适用于所有int值。例如，考虑你的问题:

乘法和除法可以使用位运算符实现…

你演示了乘法，那除法呢?

int x;
x >> 1;   // divide by 2?

根据c++标准5.8:

-3—E1 >> E2为E1位右移E2位位置。如果E1为无符号类型，或者E1为有符号类型且值为非负值，则结果值为E1的商除以2的E2次方的积分部分。如果E1具有符号类型和负值，则结果值是由实现定义的。

因此，当x为负时，位移位有一个实现定义的结果:在不同的机器上可能不会以相同的方式工作。但是，/工作起来更容易预测。(它也可能不是完全一致的，因为不同的机器可能有不同的负数表示，因此即使构成表示的位数相同，范围也不同。)

You may say "I don't care... that int is storing the age of the employee, it can never be negative". If you have that kind of special insight, then yes - your >> safe optimisation might be passed over by the compiler unless you explicitly do it in your code. But, it's risky and rarely useful as much of the time you won't have this kind of insight, and other programmers working on the same code won't know that you've bet the house on some unusual expectations of the data you'll be handling... what seems a totally safe change to them might backfire because of your "optimisation".

有没有什么输入是不能用这种方法乘或除的?

是的……如上所述，负数在被位移“分割”时具有实现定义的行为。

2011-06-17 10:28:54

这取决于处理器和编译器。一些编译器已经通过这种方式优化代码了，其他的还没有。因此，每次需要以这种方式优化代码时，您都需要检查。

除非您迫切需要优化，否则我不会为了节省汇编指令或处理器周期而打乱源代码。

2011-06-15 11:34:48

我也想看看我能不能打败房子。这是一个更通用的任意数乘任意数的位乘法。我做的宏比普通的乘法要慢25%到两倍。正如其他人所说，如果它接近2的倍数或由几个2的倍数组成，你可能会赢。比如由(X<<4)+(X<<2)+(X<<1)+X组成的X*23要比由(X<<6)+X组成的X*65慢。

#include <stdio.h>
#include <time.h>

#define MULTIPLYINTBYMINUS(X,Y) (-((X >> 30) & 1)&(Y<<30))+(-((X >> 29) & 1)&(Y<<29))+(-((X >> 28) & 1)&(Y<<28))+(-((X >> 27) & 1)&(Y<<27))+(-((X >> 26) & 1)&(Y<<26))+(-((X >> 25) & 1)&(Y<<25))+(-((X >> 24) & 1)&(Y<<24))+(-((X >> 23) & 1)&(Y<<23))+(-((X >> 22) & 1)&(Y<<22))+(-((X >> 21) & 1)&(Y<<21))+(-((X >> 20) & 1)&(Y<<20))+(-((X >> 19) & 1)&(Y<<19))+(-((X >> 18) & 1)&(Y<<18))+(-((X >> 17) & 1)&(Y<<17))+(-((X >> 16) & 1)&(Y<<16))+(-((X >> 15) & 1)&(Y<<15))+(-((X >> 14) & 1)&(Y<<14))+(-((X >> 13) & 1)&(Y<<13))+(-((X >> 12) & 1)&(Y<<12))+(-((X >> 11) & 1)&(Y<<11))+(-((X >> 10) & 1)&(Y<<10))+(-((X >> 9) & 1)&(Y<<9))+(-((X >> 8) & 1)&(Y<<8))+(-((X >> 7) & 1)&(Y<<7))+(-((X >> 6) & 1)&(Y<<6))+(-((X >> 5) & 1)&(Y<<5))+(-((X >> 4) & 1)&(Y<<4))+(-((X >> 3) & 1)&(Y<<3))+(-((X >> 2) & 1)&(Y<<2))+(-((X >> 1) & 1)&(Y<<1))+(-((X >> 0) & 1)&(Y<<0))
#define MULTIPLYINTBYSHIFT(X,Y) (((((X >> 30) & 1)<<31)>>31)&(Y<<30))+(((((X >> 29) & 1)<<31)>>31)&(Y<<29))+(((((X >> 28) & 1)<<31)>>31)&(Y<<28))+(((((X >> 27) & 1)<<31)>>31)&(Y<<27))+(((((X >> 26) & 1)<<31)>>31)&(Y<<26))+(((((X >> 25) & 1)<<31)>>31)&(Y<<25))+(((((X >> 24) & 1)<<31)>>31)&(Y<<24))+(((((X >> 23) & 1)<<31)>>31)&(Y<<23))+(((((X >> 22) & 1)<<31)>>31)&(Y<<22))+(((((X >> 21) & 1)<<31)>>31)&(Y<<21))+(((((X >> 20) & 1)<<31)>>31)&(Y<<20))+(((((X >> 19) & 1)<<31)>>31)&(Y<<19))+(((((X >> 18) & 1)<<31)>>31)&(Y<<18))+(((((X >> 17) & 1)<<31)>>31)&(Y<<17))+(((((X >> 16) & 1)<<31)>>31)&(Y<<16))+(((((X >> 15) & 1)<<31)>>31)&(Y<<15))+(((((X >> 14) & 1)<<31)>>31)&(Y<<14))+(((((X >> 13) & 1)<<31)>>31)&(Y<<13))+(((((X >> 12) & 1)<<31)>>31)&(Y<<12))+(((((X >> 11) & 1)<<31)>>31)&(Y<<11))+(((((X >> 10) & 1)<<31)>>31)&(Y<<10))+(((((X >> 9) & 1)<<31)>>31)&(Y<<9))+(((((X >> 8) & 1)<<31)>>31)&(Y<<8))+(((((X >> 7) & 1)<<31)>>31)&(Y<<7))+(((((X >> 6) & 1)<<31)>>31)&(Y<<6))+(((((X >> 5) & 1)<<31)>>31)&(Y<<5))+(((((X >> 4) & 1)<<31)>>31)&(Y<<4))+(((((X >> 3) & 1)<<31)>>31)&(Y<<3))+(((((X >> 2) & 1)<<31)>>31)&(Y<<2))+(((((X >> 1) & 1)<<31)>>31)&(Y<<1))+(((((X >> 0) & 1)<<31)>>31)&(Y<<0))
int main()
{
    int randomnumber=23;
    int randomnumber2=23;
    int checknum=23;
    clock_t start, diff;
    srand(time(0));
    start = clock();
    for(int i=0;i<1000000;i++)
    {
        randomnumber = rand() % 10000;
        randomnumber2 = rand() % 10000;
        checknum=MULTIPLYINTBYMINUS(randomnumber,randomnumber2);
        if (checknum!=randomnumber*randomnumber2)
        {
            printf("s %i and %i and %i",checknum,randomnumber,randomnumber2);
        }
    }
    diff = clock() - start;
    int msec = diff * 1000 / CLOCKS_PER_SEC;
    printf("MULTIPLYINTBYMINUS Time %d milliseconds", msec);
    start = clock();
    for(int i=0;i<1000000;i++)
    {
        randomnumber = rand() % 10000;
        randomnumber2 = rand() % 10000;
        checknum=MULTIPLYINTBYSHIFT(randomnumber,randomnumber2);
        if (checknum!=randomnumber*randomnumber2)
        {
            printf("s %i and %i and %i",checknum,randomnumber,randomnumber2);
        }
    }
    diff = clock() - start;
    msec = diff * 1000 / CLOCKS_PER_SEC;
    printf("MULTIPLYINTBYSHIFT Time %d milliseconds", msec);
    start = clock();
    for(int i=0;i<1000000;i++)
    {
        randomnumber = rand() % 10000;
        randomnumber2 = rand() % 10000;
        checknum= randomnumber*randomnumber2;
        if (checknum!=randomnumber*randomnumber2)
        {
            printf("s %i and %i and %i",checknum,randomnumber,randomnumber2);
        }
    }
    diff = clock() - start;
    msec = diff * 1000 / CLOCKS_PER_SEC;
    printf("normal * Time %d milliseconds", msec);
    return 0;
}

2020-01-12 07:51:49

我同意德鲁·霍尔的明确回答。不过，答案可能需要一些额外的注释。

对于绝大多数软件开发人员来说，处理器和编译器已经不再与问题相关。我们大多数人远远超出了8088和MS-DOS。它可能只与那些仍在开发嵌入式处理器的人有关……

在我的软件公司，Math (add/sub/mul/div)应该用于所有数学。当数据类型之间转换时应该使用Shift。字节长度为n>>8，而不是n/256。

2012-12-03 19:24:31

在C语言中使用移位运算符的乘法和除法真的更快吗?

推荐文章

最新文章

标签