例如,乘法和除法可以使用位运算符来实现
i*2 = i<<1
i*3 = (i<<1) + i;
i*10 = (i<<3) + (i<<1)
等等。
用say (i<<3)+(i<<1)乘10是否比直接用i*10更快?有没有什么输入是不能用这种方法乘或除的?
例如,乘法和除法可以使用位运算符来实现
i*2 = i<<1
i*3 = (i<<1) + i;
i*10 = (i<<3) + (i<<1)
等等。
用say (i<<3)+(i<<1)乘10是否比直接用i*10更快?有没有什么输入是不能用这种方法乘或除的?
当前回答
有些优化编译器无法做到,因为它们只适用于减少的输入集。
下面是c++示例代码,可以执行更快的除法,执行64位“乘倒数”。分子和分母都必须低于某个阈值。注意,它必须被编译为使用64位指令才能比普通除法更快。
#include <stdio.h>
#include <chrono>
static const unsigned s_bc = 32;
static const unsigned long long s_p = 1ULL << s_bc;
static const unsigned long long s_hp = s_p / 2;
static unsigned long long s_f;
static unsigned long long s_fr;
static void fastDivInitialize(const unsigned d)
{
s_f = s_p / d;
s_fr = s_f * (s_p - (s_f * d));
}
static unsigned fastDiv(const unsigned n)
{
return (s_f * n + ((s_fr * n + s_hp) >> s_bc)) >> s_bc;
}
static bool fastDivCheck(const unsigned n, const unsigned d)
{
// 32 to 64 cycles latency on modern cpus
const unsigned expected = n / d;
// At least 10 cycles latency on modern cpus
const unsigned result = fastDiv(n);
if (result != expected)
{
printf("Failed for: %u/%u != %u\n", n, d, expected);
return false;
}
return true;
}
int main()
{
unsigned result = 0;
// Make sure to verify it works for your expected set of inputs
const unsigned MAX_N = 65535;
const unsigned MAX_D = 40000;
const double ONE_SECOND_COUNT = 1000000000.0;
auto t0 = std::chrono::steady_clock::now();
unsigned count = 0;
printf("Verifying...\n");
for (unsigned d = 1; d <= MAX_D; ++d)
{
fastDivInitialize(d);
for (unsigned n = 0; n <= MAX_N; ++n)
{
count += !fastDivCheck(n, d);
}
}
auto t1 = std::chrono::steady_clock::now();
printf("Errors: %u / %u (%.4fs)\n", count, MAX_D * (MAX_N + 1), (t1 - t0).count() / ONE_SECOND_COUNT);
t0 = t1;
for (unsigned d = 1; d <= MAX_D; ++d)
{
fastDivInitialize(d);
for (unsigned n = 0; n <= MAX_N; ++n)
{
result += fastDiv(n);
}
}
t1 = std::chrono::steady_clock::now();
printf("Fast division time: %.4fs\n", (t1 - t0).count() / ONE_SECOND_COUNT);
t0 = t1;
count = 0;
for (unsigned d = 1; d <= MAX_D; ++d)
{
for (unsigned n = 0; n <= MAX_N; ++n)
{
result += n / d;
}
}
t1 = std::chrono::steady_clock::now();
printf("Normal division time: %.4fs\n", (t1 - t0).count() / ONE_SECOND_COUNT);
getchar();
return result;
}
其他回答
I think in the one case that you want to multiply or divide by a power of two, you can't go wrong with using bitshift operators, even if the compiler converts them to a MUL/DIV, because some processors microcode (really, a macro) them anyway, so for those cases you will achieve an improvement, especially if the shift is more than 1. Or more explicitly, if the CPU has no bitshift operators, it will be a MUL/DIV anyway, but if the CPU has bitshift operators, you avoid a microcode branch and this is a few instructions less.
I am writing some code right now that requires a lot of doubling/halving operations because it is working on a dense binary tree, and there is one more operation that I suspect might be more optimal than an addition - a left (power of two multiply) shift with an addition. This can be replaced with a left shift and an xor if the shift is wider than the number of bits you want to add, easy example is (i<<1)^1, which adds one to a doubled value. This does not of course apply to a right shift (power of two divide) because only a left (little endian) shift fills the gap with zeros.
在我的代码中,这些乘/除2和2的幂运算被大量使用,因为公式已经很短了,每条可以消除的指令都可以获得很大的收益。如果处理器不支持这些位移操作符,就不会有增益,也不会有损失。
Also, in the algorithms I am writing, they visually represent the movements that occur so in that sense they are in fact more clear. The left hand side of a binary tree is bigger, and the right is smaller. As well as that, in my code, odd and even numbers have a special significance, and all left-hand children in the tree are odd and all right hand children, and the root, are even. In some cases, which I haven't encountered yet, but may, oh, actually, I didn't even think of this, x&1 may be a more optimal operation compared to x%2. x&1 on an even number will produce zero, but will produce 1 for an odd number.
再深入一点,如果x和3是0,我就知道4是这个数的因数,x%7是8,以此类推。我知道这些情况可能有有限的效用,但很高兴知道你可以避免模运算而使用按位逻辑运算,因为按位运算几乎总是最快的,而且对编译器来说不太可能是模糊的。
我在很大程度上发明了密集二叉树的领域,所以我预计人们可能不会理解这个评论的价值,因为很少有人想只对2的幂进行因数分解,或者只对2的幂进行乘/除。
据我所知,在一些机器上,乘法运算可能需要16到32个机器周期。是的,根据机器类型,位移运算符比乘除运算符快。
然而,某些机器确实有它们的数学处理器,其中包含乘法/除法的特殊指令。
它是否真的更快取决于实际使用的硬件和编译器。
只是一个具体的衡量点:许多年前,我对两个进行了基准测试 我的哈希算法的版本:
unsigned
hash( char const* s )
{
unsigned h = 0;
while ( *s != '\0' ) {
h = 127 * h + (unsigned char)*s;
++ s;
}
return h;
}
and
unsigned
hash( char const* s )
{
unsigned h = 0;
while ( *s != '\0' ) {
h = (h << 7) - h + (unsigned char)*s;
++ s;
}
return h;
}
在我对它进行基准测试的每台机器上,第一台机器的速度至少和 第二。有些令人惊讶的是,它有时更快(例如在一个 Sun Sparc)。当硬件不支持快速乘法(和 大多数当时没有),编译器将转换乘法 转换成移位和加/减的适当组合。因为它 知道了最终的目标,它有时可以在少于指令的情况下这样做 当你明确地写出移位和加法/减法时。
请注意,这是15年前的事了。希望编译器 从那以后就越来越好了,所以你可以指望 编译器做正确的事情,可能比你做的更好。(另外, 这段代码看起来如此C'ish的原因是因为它是15年前的事情了。 显然,我今天会使用std::string和迭代器。)
这取决于处理器和编译器。一些编译器已经通过这种方式优化代码了,其他的还没有。 因此,每次需要以这种方式优化代码时,您都需要检查。
除非您迫切需要优化,否则我不会为了节省汇编指令或处理器周期而打乱源代码。