计算32位整数中的设置位数

代表数字7的8位像这样:

00000111

设置了三个比特。

确定32位整数中设置位数的算法是什么?

当前回答

大约在1990年，我为RISC机器编写了一个快速比特计数宏。它不使用高级算术(乘法，除法，%)，内存提取(太慢)，分支(太慢)，但它确实假设CPU有一个32位的桶移位器(换句话说，>> 1和>> 32占用相同的周期)。它假定小常数(如6、12、24)加载到寄存器中不需要花费任何代价，或者存储在临时变量中并反复重用。

在这些假设下，在大多数RISC机器上，它在大约16个周期/指令中计算32位。注意，15条指令/周期接近于周期或指令数量的下界，因为似乎至少需要3条指令(掩码、移位、运算符)才能将加数的数量减半，因此log_2(32) = 5,5 x 3 = 15条指令是准下界。

#define BitCount(X,Y)           \
                Y = X - ((X >> 1) & 033333333333) - ((X >> 2) & 011111111111); \
                Y = ((Y + (Y >> 3)) & 030707070707); \
                Y =  (Y + (Y >> 6)); \
                Y = (Y + (Y >> 12) + (Y >> 24)) & 077;

这是第一步也是最复杂的一步:

input output
AB    CD             Note
00    00             = AB
01    01             = AB
10    01             = AB - (A >> 1) & 0x1
11    10             = AB - (A >> 1) & 0x1

所以如果我取上面的第一列(A)，右移1位，然后从AB减去它，我就得到了输出(CD)。扩展到3位类似;如果你愿意，你可以用一个8行布尔表来检查它。

不吉利

2010-06-11 21:40:40

其他回答

我使用下面的函数。我还没有检查基准测试，但它是有效的。

int msb(int num)
{
    int m = 0;
    for (int i = 16; i > 0; i = i>>1)
    {
        // debug(i, num, m);
        if(num>>i)
        {
            m += i;
            num>>=i;
        }
    }
    return m;
}

2013-05-29 09:16:30

这是一个有助于了解您的微架构的问题。我只是在gcc 4.3.3下用-O3编译的两个变量使用c++内联来计时，以消除函数调用开销，十亿次迭代，保持所有计数的运行总和，以确保编译器不删除任何重要的东西，使用rdtsc计时(精确的时钟周期)。

inline int pop2(unsigned x, unsigned y)
{
    x = x - ((x >> 1) & 0x55555555);
    y = y - ((y >> 1) & 0x55555555);
    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
    y = (y & 0x33333333) + ((y >> 2) & 0x33333333);
    x = (x + (x >> 4)) & 0x0F0F0F0F;
    y = (y + (y >> 4)) & 0x0F0F0F0F;
    x = x + (x >> 8);
    y = y + (y >> 8);
    x = x + (x >> 16);
    y = y + (y >> 16);
    return (x+y) & 0x000000FF;
}

未经修改的黑客喜悦需要122亿周期。我的并行版本(计算的比特数是它的两倍)的运行周期为13.0千兆周期。在2.4GHz的酷睿双核上，两者总共消耗了10.5秒。在这个时钟频率下，25千兆周期= 10秒多一点，所以我相信我的计时是正确的。

这与指令依赖链有关，这对算法非常不利。通过使用一对64位寄存器，我几乎可以再次将速度提高一倍。事实上，如果我聪明一点，早点加上x+y，我就可以减少一些移位。64位版本做了一些小的调整，结果是相同的，但又增加了一倍的比特数。

对于128位SIMD寄存器，这是另一个因素，SSE指令集通常也有聪明的快捷方式。

没有理由让代码特别透明。该算法界面简单，可在多处在线引用，并能通过全面的单元测试。偶然发现它的程序员甚至可能学到一些东西。这些位操作在机器级别上是非常自然的。

好吧，我决定搁置调整后的64位版本。对于这个sizeof(unsigned long) == 8

inline int pop2(unsigned long x, unsigned long y)
{
    x = x - ((x >> 1) & 0x5555555555555555);
    y = y - ((y >> 1) & 0x5555555555555555);
    x = (x & 0x3333333333333333) + ((x >> 2) & 0x3333333333333333);
    y = (y & 0x3333333333333333) + ((y >> 2) & 0x3333333333333333);
    x = (x + (x >> 4)) & 0x0F0F0F0F0F0F0F0F;
    y = (y + (y >> 4)) & 0x0F0F0F0F0F0F0F0F;
    x = x + y; 
    x = x + (x >> 8);
    x = x + (x >> 16);
    x = x + (x >> 32); 
    return x & 0xFF;
}

这看起来是对的(不过我没有仔细测试)。现在计时结果是10.70亿周期/ 14.1亿周期。后面的数字加起来是1280亿比特，相当于这台机器运行了5.9秒。非并行版本稍微加快了一点，因为我在64位模式下运行，它更喜欢64位寄存器，而不是32位寄存器。

让我们看看这里是否有更多的OOO管道。这有点复杂，所以我实际上测试了一些。每一项单独加起来是64，所有项加起来是256。

inline int pop4(unsigned long x, unsigned long y, 
                unsigned long u, unsigned long v)
{
  enum { m1 = 0x5555555555555555, 
         m2 = 0x3333333333333333, 
         m3 = 0x0F0F0F0F0F0F0F0F, 
         m4 = 0x000000FF000000FF };

    x = x - ((x >> 1) & m1);
    y = y - ((y >> 1) & m1);
    u = u - ((u >> 1) & m1);
    v = v - ((v >> 1) & m1);
    x = (x & m2) + ((x >> 2) & m2);
    y = (y & m2) + ((y >> 2) & m2);
    u = (u & m2) + ((u >> 2) & m2);
    v = (v & m2) + ((v >> 2) & m2);
    x = x + y; 
    u = u + v; 
    x = (x & m3) + ((x >> 4) & m3);
    u = (u & m3) + ((u >> 4) & m3);
    x = x + u; 
    x = x + (x >> 8);
    x = x + (x >> 16);
    x = x & m4; 
    x = x + (x >> 32);
    return x & 0x000001FF;
}

我兴奋了一会儿，但结果是gcc在-O3上玩内联的把戏，尽管我在一些测试中没有使用内联关键字。当我让gcc玩把戏时，对pop4()的十亿次调用需要12.56 gigacycles，但我确定它是将参数折叠为常量表达式。更实际的数字似乎是19.6gc，以实现30%的加速。我的测试循环现在看起来像这样，确保每个参数足够不同，以阻止gcc耍花招。

   hitime b4 = rdtsc(); 
   for (unsigned long i = 10L * 1000*1000*1000; i < 11L * 1000*1000*1000; ++i) 
      sum += pop4 (i,  i^1, ~i, i|1); 
   hitime e4 = rdtsc();

2560亿比特加起来在8.17秒内过去了。根据16位表查找的基准测试，3200万比特的计算结果为1.02秒。不能直接比较，因为另一个工作台没有给出时钟速度，但看起来我已经把64KB表版本的鼻涕打出来了，这首先是L1缓存的悲惨使用。

更新:决定做明显的和创建pop6()通过增加四个重复的行。结果是22.8gc, 3840亿比特在9.5秒内加起来。所以还有20%现在是800毫秒，320亿比特。

2009-10-02 21:34:54

32位还是32位?我只是在阅读了“破解编码面试”第4版练习5.5(第5章:位操作)后，在Java中使用了这种方法。如果最小有效位是1个增量计数，则右移该整数。

public static int bitCount( int n){
    int count = 0;
    for (int i=n; i!=0; i = i >> 1){
        count += i & 1;
    }
    return count;
}

我认为这个比常数0x33333333的解更直观，不管它们有多快。这取决于你对“最佳算法”的定义。

2011-11-15 23:52:50

有些语言以一种可以使用有效硬件支持(如果可用的话)的方式可移植地公开操作，而有些语言则希望使用一些不错的库。

例如(从语言表中):

c++有std::bitset<>::count()或c++ 20 std::popcount(T x) Java有Java .lang. integer . bitcount()(也用于Long或BigInteger) c#有system . numbers . bitoperations . popcount () Python有int.bit_count()(从3.10开始)

不过，并不是所有的编译器/库都能在HW支持可用时使用它。(值得注意的是MSVC，即使有选项使std::popcount内联为x86 popcnt，它的std::bitset::count仍然总是使用查找表。这有望在未来的版本中改变。)

当可移植语言没有这种基本的位操作时，还要考虑编译器的内置函数。以GNU C为例:

int __builtin_popcount (unsigned int x);
int __builtin_popcountll (unsigned long long x);

In the worst case (no single-instruction HW support) the compiler will generate a call to a function (which in current GCC uses a shift/and bit-hack like this answer, at least for x86). In the best case the compiler will emit a cpu instruction to do the job. (Just like a * or / operator - GCC will use a hardware multiply or divide instruction if available, otherwise will call a libgcc helper function.) Or even better, if the operand is a compile-time constant after inlining, it can do constant-propagation to get a compile-time-constant popcount result.

GCC内置甚至可以跨多个平台工作。Popcount几乎已经成为x86架构的主流，所以现在开始使用内置是有意义的，这样你就可以重新编译，让它内联硬件指令时，你编译-mpopcnt或包括(例如https://godbolt.org/z/Ma5e5a)。其他架构已经有popcount很多年了，但在x86领域，仍然有一些古老的Core 2和类似的老式AMD cpu在使用。

在x86上，你可以告诉编译器它可以通过-mpopcnt(也可以通过-msse4.2暗示)假设支持popcnt指令。参见GCC x86选项。-march=nehalem -mtune=skylake(或-march=任何您希望您的代码假设和调优的CPU)可能是一个不错的选择。在较旧的CPU上运行生成的二进制文件将导致非法指令错误。

要为构建它们的机器优化二进制文件，请使用-march=native(与gcc、clang或ICC一起使用)。

MSVC为x86的popcnt指令提供了一个内在的特性，但与gcc不同的是，它实际上是硬件指令的一个内在特性，需要硬件支持。

使用std::bitset<>::count()代替内置的

理论上，任何知道如何有效地为目标CPU进行popcount的编译器都应该通过ISO c++ std::bitset<>来公开该功能。实际上，对于某些目标cpu，在某些情况下使用bit-hack AND/shift/ADD可能会更好。

For target architectures where hardware popcount is an optional extension (like x86), not all compilers have a std::bitset that takes advantage of it when available. For example, MSVC has no way to enable popcnt support at compile time, and it's std::bitset<>::count always uses a table lookup, even with /Ox /arch:AVX (which implies SSE4.2, which in turn implies the popcnt feature.) (Update: see below; that does get MSVC's C++20 std::popcount to use x86 popcnt, but still not its bitset<>::count. MSVC could fix that by updating their standard library headers to use std::popcount when available.)

但是，至少您得到了可以在任何地方工作的可移植的东西，并且使用带有正确目标选项的gcc/clang，您可以获得支持它的体系结构的硬件popcount。

#include <bitset>
#include <limits>
#include <type_traits>

template<typename T>
//static inline  // static if you want to compile with -mpopcnt in one compilation unit but not others
typename std::enable_if<std::is_integral<T>::value,  unsigned >::type 
popcount(T x)
{
    static_assert(std::numeric_limits<T>::radix == 2, "non-binary type");

    // sizeof(x)*CHAR_BIT
    constexpr int bitwidth = std::numeric_limits<T>::digits + std::numeric_limits<T>::is_signed;
    // std::bitset constructor was only unsigned long before C++11.  Beware if porting to C++03
    static_assert(bitwidth <= std::numeric_limits<unsigned long long>::digits, "arg too wide for std::bitset() constructor");

    typedef typename std::make_unsigned<T>::type UT;        // probably not needed, bitset width chops after sign-extension

    std::bitset<bitwidth> bs( static_cast<UT>(x) );
    return bs.count();
}

参见Godbolt编译器资源管理器上gcc、clang、icc和MSVC中的asm。

x86-64 gcc -O3 -std=gnu++11 -mpopcnt输出:

unsigned test_short(short a) { return popcount(a); }
    movzx   eax, di      # note zero-extension, not sign-extension
    popcnt  rax, rax
    ret

unsigned test_int(int a) { return popcount(a); }
    mov     eax, edi
    popcnt  rax, rax        # unnecessary 64-bit operand size
    ret

unsigned test_u64(unsigned long long a) { return popcount(a); }
    xor     eax, eax     # gcc avoids false dependencies for Intel CPUs
    popcnt  rax, rdi
    ret

PowerPC64 gcc -O3 -std=gnu++11发出(对于int arg版本):

    rldicl 3,3,0,32     # zero-extend from 32 to 64-bit
    popcntd 3,3         # popcount
    blr

这个源代码不是x86特定的，也不是gnu特定的，只是在gcc/clang/icc下编译得很好，至少在针对x86(包括x86-64)时是这样。

还要注意，对于没有单指令popcount的体系结构，gcc的回退是逐字节表查找。例如，这对ARM来说就不是什么好事。

c++ 20有std::popcount(T)

不幸的是，当前libstdc++头文件用特殊情况定义了它，if(x==0) return 0;在开始时，clang在编译x86时不会优化:

#include <bit>
int bar(unsigned x) {
    return std::popcount(x);
}

clang 11.0.1 -O3 -std=gnu++20 -march=nehalem （https://godbolt.org/z/arMe5a）

# clang 11
    bar(unsigned int):                                # @bar(unsigned int)
        popcnt  eax, edi
        cmove   eax, edi         # redundant: if popcnt result is 0, return the original 0 instead of the popcnt-generated 0...
        ret

但是GCC编译得很好:

# gcc 10
        xor     eax, eax         # break false dependency on Intel SnB-family before Ice Lake.
        popcnt  eax, edi
        ret

即使是MSVC也能很好地使用它，只要你使用-arch:AVX或更高版本(并使用-std:c++latest启用c++ 20)。https://godbolt.org/z/7K4Gef

int bar(unsigned int) PROC                                 ; bar, COMDAT
        popcnt  eax, ecx
        ret     0
int bar(unsigned int) ENDP                                 ; bar

2008-09-20 19:23:05

我觉得很无聊，于是对三种方法进行了十亿次迭代。编译器是gcc -O3。CPU就是第一代Macbook Pro里装的东西。

最快的是3.7秒:

static unsigned char wordbits[65536] = { bitcounts of ints between 0 and 65535 };
static int popcount( unsigned int i )
{
    return( wordbits[i&0xFFFF] + wordbits[i>>16] );
}

第二名是相同的代码，但查找的是4个字节而不是2个半字。这花了大约5.5秒。

第三名是“横向加法”法，用时8.6秒。

第四名是GCC的__builtin_popcount()，仅为11秒。

一次一个比特的计数方法要慢得多，我厌倦了等待它完成。

因此，如果您最关心的是性能，那么请使用第一种方法。如果您关心它，但又不想在上面花费64Kb的RAM，那么可以使用第二种方法。否则，请使用可读的(但速度较慢)一次一位的方法。

很难想象在什么情况下你会想要使用比特旋转方法。

编辑:这里也有类似的结果。

2008-09-25 02:46:32

计算32位整数中的设置位数

推荐文章

最新文章

标签