我一直认为std::vector是“作为数组实现的”,等等等等。今天我去测试了一下,结果似乎不是这样:

以下是一些测试结果:

UseArray completed in 2.619 seconds
UseVector completed in 9.284 seconds
UseVectorPushBack completed in 14.669 seconds
The whole thing completed in 26.591 seconds

这大约要慢3 - 4倍!这并不能证明“向量可能会慢几纳秒”的评论是正确的。

我使用的代码是:

#include <cstdlib>
#include <vector>

#include <iostream>
#include <string>

#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/microsec_time_clock.hpp>

class TestTimer
{
    public:
        TestTimer(const std::string & name) : name(name),
            start(boost::date_time::microsec_clock<boost::posix_time::ptime>::local_time())
        {
        }

        ~TestTimer()
        {
            using namespace std;
            using namespace boost;

            posix_time::ptime now(date_time::microsec_clock<posix_time::ptime>::local_time());
            posix_time::time_duration d = now - start;

            cout << name << " completed in " << d.total_milliseconds() / 1000.0 <<
                " seconds" << endl;
        }

    private:
        std::string name;
        boost::posix_time::ptime start;
};

struct Pixel
{
    Pixel()
    {
    }

    Pixel(unsigned char r, unsigned char g, unsigned char b) : r(r), g(g), b(b)
    {
    }

    unsigned char r, g, b;
};

void UseVector()
{
    TestTimer t("UseVector");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels;
        pixels.resize(dimension * dimension);

        for(int i = 0; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }
    }
}

void UseVectorPushBack()
{
    TestTimer t("UseVectorPushBack");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels;
            pixels.reserve(dimension * dimension);

        for(int i = 0; i < dimension * dimension; ++i)
            pixels.push_back(Pixel(255, 0, 0));
    }
}

void UseArray()
{
    TestTimer t("UseArray");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);

        for(int i = 0 ; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }

        free(pixels);
    }
}

int main()
{
    TestTimer t1("The whole thing");

    UseArray();
    UseVector();
    UseVectorPushBack();

    return 0;
}

我做错了吗?还是我刚刚打破了这个性能神话?

我使用Visual Studio 2005中的发布模式。


在Visual c++中,#define _SECURE_SCL 0将UseVector减少了一半(减少到4秒)。在我看来,这真的是件大事。


当前回答

好问题。我来这里是希望能找到一些简单的方法来加快矢量测试的速度。结果跟我想象的不太一样!

优化有帮助,但这还不够。通过优化,我仍然看到UseArray和UseVector之间的2X性能差异。有趣的是,UseVector明显比没有优化的UseVectorPushBack慢。

# g++ -Wall -Wextra -pedantic -o vector vector.cpp
# ./vector
UseArray completed in 20.68 seconds
UseVector completed in 120.509 seconds
UseVectorPushBack completed in 37.654 seconds
The whole thing completed in 178.845 seconds
# g++ -Wall -Wextra -pedantic -O3 -o vector vector.cpp
# ./vector
UseArray completed in 3.09 seconds
UseVector completed in 6.09 seconds
UseVectorPushBack completed in 9.847 seconds
The whole thing completed in 19.028 seconds

想法1 -使用new[]代替malloc

我尝试在UseArray中将malloc()更改为new[],以便构造对象。从单个字段分配到分配一个Pixel实例。哦,重命名内循环变量为j。

void UseArray()
{
    TestTimer t("UseArray");

    for(int i = 0; i < 1000; ++i)
    {   
        int dimension = 999;

        // Same speed as malloc().
        Pixel * pixels = new Pixel[dimension * dimension];

        for(int j = 0 ; j < dimension * dimension; ++j)
            pixels[j] = Pixel(255, 0, 0);

        delete[] pixels;
    }
}

令人惊讶的是(对我来说),这些变化没有任何不同。甚至没有更改为new[],这将默认构造所有的像素。看起来gcc在使用new[]时可以优化默认构造函数调用,但在使用vector时就不行。

想法#2 -删除重复的操作符[]调用

我还尝试摆脱三重运算符[]查找,并缓存对像素[j]的引用。这实际上降低了UseVector的速度!哦。

for(int j = 0; j < dimension * dimension; ++j)
{
    // Slower than accessing pixels[j] three times.
    Pixel &pixel = pixels[j];
    pixel.r = 255;
    pixel.g = 0;
    pixel.b = 0;
}

# ./vector 
UseArray completed in 3.226 seconds
UseVector completed in 7.54 seconds
UseVectorPushBack completed in 9.859 seconds
The whole thing completed in 20.626 seconds

想法#3 -删除构造函数

如果完全删除构造函数呢?然后,也许gcc可以在创建向量时优化所有对象的结构。如果我们把像素改为:

struct Pixel
{
    unsigned char r, g, b;
};

结果:大约快10%。还是比数组慢。嗯。

# ./vector 
UseArray completed in 3.239 seconds
UseVector completed in 5.567 seconds

想法4 -使用迭代器而不是循环索引

如何使用vector<Pixel>::iterator代替循环索引?

for (std::vector<Pixel>::iterator j = pixels.begin(); j != pixels.end(); ++j)
{
    j->r = 255;
    j->g = 0;
    j->b = 0;
}

结果:

# ./vector 
UseArray completed in 3.264 seconds
UseVector completed in 5.443 seconds

没有什么不同。至少没有变慢。我认为这将具有类似于#2的性能,其中我使用了Pixel&引用。

结论

即使一些聪明的cookie找到了如何使vector循环和数组循环一样快的方法,这也不能说明std::vector的默认行为。编译器足够聪明,可以优化所有c++特性,并使STL容器像原始数组一样快。

底线是,当使用std::vector时,编译器无法优化掉无操作的默认构造函数调用。如果你使用普通的new[],它就能很好地优化它们。但不是std::vector。即使你可以重写你的代码,以消除构造函数调用,在这里的咒语:“编译器比你聪明。STL和普通c一样快,不用担心。”

其他回答

Martin York的回答让我很困扰,因为他似乎试图掩盖初始化问题。但他将冗余的默认构造确定为性能问题的根源是正确的。

[编辑:Martin的回答不再建议更改默认构造函数。]

对于眼前的问题,你当然可以调用2参数版本的向量<Pixel> ctor:

std::vector<Pixel> pixels(dimension * dimension, Pixel(255, 0, 0));

如果你想用一个常数值初始化,这是一种常见的情况。但更普遍的问题是:如何有效地初始化比常数值更复杂的东西?

为此,您可以使用back_insert_iterator,这是一个迭代器适配器。这里有一个int类型的向量的例子,尽管一般的思想也适用于像素:

#include <iterator>
// Simple functor return a list of squares: 1, 4, 9, 16...
struct squares {
    squares() { i = 0; }
    int operator()() const { ++i; return i * i; }

private:
    int i;
};

...

std::vector<int> v;
v.reserve(someSize);     // To make insertions efficient
std::generate_n(std::back_inserter(v), someSize, squares());

或者,您可以使用copy()或transform()来代替generate_n()。

缺点是,构造初始值的逻辑需要移动到一个单独的类中,这比将其放在原位更不方便(尽管c++ 1x中的lambdas使这更好)。此外,我希望这仍然不会像基于malloc()的非stl版本那样快,但我希望它会接近,因为它只对每个元素进行一次构造。

我做了一些长期以来一直想做的广泛测试。不妨分享一下。

这是我的双启动机i7-3770, 16GB Ram, x86_64, Windows 8.1和Ubuntu 16.04。更多信息和结论,备注如下。测试了MSVS 2017和g++(在Windows和Linux上)。

测试程序

#include <iostream>
#include <chrono>
//#include <algorithm>
#include <array>
#include <locale>
#include <vector>
#include <queue>
#include <deque>

// Note: total size of array must not exceed 0x7fffffff B = 2,147,483,647B
//  which means that largest int array size is 536,870,911
// Also image size cannot be larger than 80,000,000B
constexpr int long g_size = 100000;
int g_A[g_size];


int main()
{
    std::locale loc("");
    std::cout.imbue(loc);
    constexpr int long size = 100000;  // largest array stack size

    // stack allocated c array
    std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
    int A[size];
    for (int i = 0; i < size; i++)
        A[i] = i;

    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "c-style stack array duration=" << duration / 1000.0 << "ms\n";
    std::cout << "c-style stack array size=" << sizeof(A) << "B\n\n";

    // global stack c array
    start = std::chrono::steady_clock::now();
    for (int i = 0; i < g_size; i++)
        g_A[i] = i;

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "global c-style stack array duration=" << duration / 1000.0 << "ms\n";
    std::cout << "global c-style stack array size=" << sizeof(g_A) << "B\n\n";

    // raw c array heap array
    start = std::chrono::steady_clock::now();
    int* AA = new int[size];    // bad_alloc() if it goes higher than 1,000,000,000
    for (int i = 0; i < size; i++)
        AA[i] = i;

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "c-style heap array duration=" << duration / 1000.0 << "ms\n";
    std::cout << "c-style heap array size=" << sizeof(AA) << "B\n\n";
    delete[] AA;

    // std::array<>
    start = std::chrono::steady_clock::now();
    std::array<int, size> AAA;
    for (int i = 0; i < size; i++)
        AAA[i] = i;
    //std::sort(AAA.begin(), AAA.end());

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "std::array duration=" << duration / 1000.0 << "ms\n";
    std::cout << "std::array size=" << sizeof(AAA) << "B\n\n";

    // std::vector<>
    start = std::chrono::steady_clock::now();
    std::vector<int> v;
    for (int i = 0; i < size; i++)
        v.push_back(i);
    //std::sort(v.begin(), v.end());

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "std::vector duration=" << duration / 1000.0 << "ms\n";
    std::cout << "std::vector size=" << v.size() * sizeof(v.back()) << "B\n\n";

    // std::deque<>
    start = std::chrono::steady_clock::now();
    std::deque<int> dq;
    for (int i = 0; i < size; i++)
        dq.push_back(i);
    //std::sort(dq.begin(), dq.end());

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "std::deque duration=" << duration / 1000.0 << "ms\n";
    std::cout << "std::deque size=" << dq.size() * sizeof(dq.back()) << "B\n\n";

    // std::queue<>
    start = std::chrono::steady_clock::now();
    std::queue<int> q;
    for (int i = 0; i < size; i++)
        q.push(i);

    duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
    std::cout << "std::queue duration=" << duration / 1000.0 << "ms\n";
    std::cout << "std::queue size=" << q.size() * sizeof(q.front()) << "B\n\n";
}

结果

//////////////////////////////////////////////////////////////////////////////////////////
// with MSVS 2017:
// >> cl /std:c++14 /Wall -O2 array_bench.cpp
//
// c-style stack array duration=0.15ms
// c-style stack array size=400,000B
//
// global c-style stack array duration=0.130ms
// global c-style stack array size=400,000B
//
// c-style heap array duration=0.90ms
// c-style heap array size=4B
//
// std::array duration=0.20ms
// std::array size=400,000B
//
// std::vector duration=0.544ms
// std::vector size=400,000B
//
// std::deque duration=1.375ms
// std::deque size=400,000B
//
// std::queue duration=1.491ms
// std::queue size=400,000B
//
//////////////////////////////////////////////////////////////////////////////////////////
//
// with g++ version:
//      - (tdm64-1) 5.1.0 on Windows
//      - (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609 on Ubuntu 16.04
// >> g++ -std=c++14 -Wall -march=native -O2 array_bench.cpp -o array_bench
//
// c-style stack array duration=0ms
// c-style stack array size=400,000B
//
// global c-style stack array duration=0.124ms
// global c-style stack array size=400,000B
//
// c-style heap array duration=0.648ms
// c-style heap array size=8B
//
// std::array duration=1ms
// std::array size=400,000B
//
// std::vector duration=0.402ms
// std::vector size=400,000B
//
// std::deque duration=0.234ms
// std::deque size=400,000B
//
// std::queue duration=0.304ms
// std::queue size=400,000
//
//////////////////////////////////////////////////////////////////////////////////////////

笔记

平均10次组装。 我最初也使用std::sort()执行测试(您可以看到它被注释掉了),但后来删除了它们,因为没有显著的相对差异。

我的结论和评论

notice how global c-style array takes almost as much time as the heap c-style array Out of all tests I noticed a remarkable stability in std::array's time variations between consecutive runs, while others especially std:: data structs varied wildly in comparison O3 optimization didn't show any noteworthy time differences Removing optimization on Windows cl (no -O2) and on g++ (Win/Linux no -O2, no -march=native) increases times SIGNIFICANTLY. Particularly for std::data structs. Overall higher times on MSVS than g++, but std::array and c-style arrays faster on Windows without optimization g++ produces faster code than microsoft's compiler (apparently it runs faster even on Windows).

判决

当然,这是用于优化构建的代码。既然问题是关于std::vector,那么是的,它是!比普通数组(优化/未优化)慢。但是当您进行基准测试时,您自然希望生成优化的代码。

对我来说,这个节目的明星是std::array。

我不得不说我不是c++方面的专家。但要补充一些实验结果:

编译: gcc-6.2.0/bin/g++ -O3 -std=c++14 vector.cpp

机:

Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz 

OS:

2.6.32-642.13.1.el6.x86_64

输出:

UseArray completed in 0.167821 seconds
UseVector completed in 0.134402 seconds
UseConstructor completed in 0.134806 seconds
UseFillConstructor completed in 1.00279 seconds
UseVectorPushBack completed in 6.6887 seconds
The whole thing completed in 8.12888 seconds

这里我唯一感到奇怪的是“UseFillConstructor”的性能与“UseConstructor”相比。

代码:

void UseConstructor()
{
    TestTimer t("UseConstructor");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels(dimension*dimension);
        for(int i = 0; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }
    }
}


void UseFillConstructor()
{
    TestTimer t("UseFillConstructor");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels(dimension*dimension, Pixel(255,0,0));
    }
}

因此提供的额外“值”大大降低了性能,我认为这是由于多次调用复制构造函数造成的。但是…

编译:

gcc-6.2.0/bin/g++ -std=c++14 -O vector.cpp

输出:

UseArray completed in 1.02464 seconds
UseVector completed in 1.31056 seconds
UseConstructor completed in 1.47413 seconds
UseFillConstructor completed in 1.01555 seconds
UseVectorPushBack completed in 6.9597 seconds
The whole thing completed in 11.7851 seconds

因此,在这种情况下,gcc优化非常重要,但当一个值作为默认值提供时,它帮不了你太多。这,其实是对我的学费。希望它能帮助新程序员选择哪种矢量初始化格式。

一些分析器数据(像素对齐为32位):

g++ -msse3 -O3 -ftree-vectorize -g test.cpp -DNDEBUG && ./a.out
UseVector completed in 3.123 seconds
UseArray completed in 1.847 seconds
UseVectorPushBack completed in 9.186 seconds
The whole thing completed in 14.159 seconds

Blah

andrey@nv:~$ opannotate --source libcchem/src/a.out  | grep "Total samples for file" -A3
Overflow stats not available
 * Total samples for file : "/usr/include/c++/4.4/ext/new_allocator.h"
 *
 * 141008 52.5367
 */
--
 * Total samples for file : "/home/andrey/libcchem/src/test.cpp"
 *
 *  61556 22.9345
 */
--
 * Total samples for file : "/usr/include/c++/4.4/bits/stl_vector.h"
 *
 *  41956 15.6320
 */
--
 * Total samples for file : "/usr/include/c++/4.4/bits/stl_uninitialized.h"
 *
 *  20956  7.8078
 */
--
 * Total samples for file : "/usr/include/c++/4.4/bits/stl_construct.h"
 *
 *   2923  1.0891
 */

在分配器:

               :      // _GLIBCXX_RESOLVE_LIB_DEFECTS
               :      // 402. wrong new expression in [some_] allocator::construct
               :      void
               :      construct(pointer __p, const _Tp& __val)
141008 52.5367 :      { ::new((void *)__p) _Tp(__val); }

向量:

               :void UseVector()
               :{ /* UseVector() total:  60121 22.3999 */
...
               :
               :
 10790  4.0201 :        for (int i = 0; i < dimension * dimension; ++i) {
               :
   495  0.1844 :            pixels[i].r = 255;
               :
 12618  4.7012 :            pixels[i].g = 0;
               :
  2253  0.8394 :            pixels[i].b = 0;
               :
               :        }

数组

               :void UseArray()
               :{ /* UseArray() total:  35191 13.1114 */
               :
...
               :
   136  0.0507 :        for (int i = 0; i < dimension * dimension; ++i) {
               :
  9897  3.6874 :            pixels[i].r = 255;
               :
  3511  1.3081 :            pixels[i].g = 0;
               :
 21647  8.0652 :            pixels[i].b = 0;

大部分开销都在复制构造函数中。例如,

    std::vector < Pixel > pixels;//(dimension * dimension, Pixel());

    pixels.reserve(dimension * dimension);

    for (int i = 0; i < dimension * dimension; ++i) {

        pixels[i].r = 255;

        pixels[i].g = 0;

        pixels[i].b = 0;
    }

它具有与数组相同的性能。

尝试禁用检查迭代器并在发布模式下构建。您应该不会看到太大的性能差异。