为什么在 C + + 中拆分字符串比 Python 慢?

我正在尝试将一些 Python 代码转换成 C + + ,以便获得一点速度,并提高我生疏的 C + + 技能。昨天,当一个从 stdin 读取代码行的幼稚实现在 Python 中比 C + + 快得多时,我感到非常震惊(参见 这个)。今天,我终于弄明白了如何在 C + + 中使用合并分隔符来分隔字符串(与 python 的 split ()语义相似) ,现在我有种似曾相识的感觉!我的 c + + 代码需要花费更长的时间来完成这项工作(尽管不是像昨天的课程那样多一个数量级)。

Python 代码:

#!/usr/bin/env python
from __future__ import print_function
import time
import sys


count = 0
start_time = time.time()
dummy = None


for line in sys.stdin:
dummy = line.split()
count += 1


delta_sec = int(time.time() - start_time)
print("Python: Saw {0} lines in {1} seconds. ".format(count, delta_sec), end='')
if delta_sec > 0:
lps = int(count/delta_sec)
print("  Crunch Speed: {0}".format(lps))
else:
print('')

C + + 代码:

#include <iostream>
#include <string>
#include <sstream>
#include <time.h>
#include <vector>


using namespace std;


void split1(vector<string> &tokens, const string &str,
const string &delimiters = " ") {
// Skip delimiters at beginning
string::size_type lastPos = str.find_first_not_of(delimiters, 0);


// Find first non-delimiter
string::size_type pos = str.find_first_of(delimiters, lastPos);


while (string::npos != pos || string::npos != lastPos) {
// Found a token, add it to the vector
tokens.push_back(str.substr(lastPos, pos - lastPos));
// Skip delimiters
lastPos = str.find_first_not_of(delimiters, pos);
// Find next non-delimiter
pos = str.find_first_of(delimiters, lastPos);
}
}


void split2(vector<string> &tokens, const string &str, char delim=' ') {
stringstream ss(str); //convert string to stream
string item;
while(getline(ss, item, delim)) {
tokens.push_back(item); //add token to vector
}
}


int main() {
string input_line;
vector<string> spline;
long count = 0;
int sec, lps;
time_t start = time(NULL);


cin.sync_with_stdio(false); //disable synchronous IO


while(cin) {
getline(cin, input_line);
spline.clear(); //empty the vector for the next line to parse


//I'm trying one of the two implementations, per compilation, obviously:
//        split1(spline, input_line);
split2(spline, input_line);


count++;
};


count--; //subtract for final over-read
sec = (int) time(NULL) - start;
cerr << "C++   : Saw " << count << " lines in " << sec << " seconds." ;
if (sec > 0) {
lps = count / sec;
cerr << "  Crunch speed: " << lps << endl;
} else
cerr << endl;
return 0;


//compiled with: g++ -Wall -O3 -o split1 split_1.cpp

请注意,我尝试了两种不同的拆分实现。其中一个(split1)使用字符串方法来搜索令牌,并且能够合并多个令牌以及处理多个令牌(它来自 给你)。第二个(split2)使用 getline 将字符串读取为流,不合并分隔符,并且只支持一个分隔符字符(这个字符是几个 StackOverflow 用户在回答字符串分隔问题时发布的)。

我按不同顺序运行了好几次。我的测试机是一台 Macbook Pro (2011,8 GB,四核) ,这并不重要。我正在测试一个20M 行的文本文件,其中包含三个空格分隔的列,每个列看起来都类似于下面的内容: “ foo.bar 127.0.0.1 home.foo.bar”

结果:

$ /usr/bin/time cat test_lines_double | ./split.py
15.61 real         0.01 user         0.38 sys
Python: Saw 20000000 lines in 15 seconds.   Crunch Speed: 1333333
$ /usr/bin/time cat test_lines_double | ./split1
23.50 real         0.01 user         0.46 sys
C++   : Saw 20000000 lines in 23 seconds.  Crunch speed: 869565
$ /usr/bin/time cat test_lines_double | ./split2
44.69 real         0.02 user         0.62 sys
C++   : Saw 20000000 lines in 45 seconds.  Crunch speed: 444444

我做错了什么?在 C + + 中,有没有一种更好的方法来进行字符串分割,它不依赖于外部库(也就是说,没有提升) ,支持合并分隔符序列(比如 python 分割) ,是线程安全的(所以没有 strtok) ,而且其性能至少与 python 相当?

编辑1/部分解决方案? :

我尝试通过让 python 重置虚拟列表并每次都将其添加到列表中来进行更公平的比较,就像 C + + 所做的那样。这仍然不完全是 C + + 代码正在做的事情,但它更接近。基本上,现在的循环是:

for line in sys.stdin:
dummy = []
dummy += line.split()
count += 1

现在,python 的性能与 split1C + + 实现大致相同。

/usr/bin/time cat test_lines_double | ./split5.py
22.61 real         0.01 user         0.40 sys
Python: Saw 20000000 lines in 22 seconds.   Crunch Speed: 909090

即使 Python 针对字符串处理进行了如此优化(如 Matt Joiner 所建议的) ,这些 C + + 实现也不会更快,这仍然让我感到惊讶。如果有人对如何使用 C + + 以更优化的方式实现这一点有什么想法,请与我们分享您的代码。(我认为我的下一步将尝试用纯 C 语言实现它,尽管我不会牺牲程序员的工作效率来用 C 语言重新实现我的整个项目,所以这将只是一个字符串分解速度的实验。)

感谢大家的帮助。

最终编辑/解决方案:

请看看阿尔夫接受的答案。由于 python 严格地通过引用处理字符串,并且常常复制 STL 字符串,因此普通的 python 实现的性能更好。为了进行比较,我编译并运行了 Alf 的代码,下面是在同一台机器上运行的所有代码的性能,基本上与初始的 python 实现相同(尽管比重置/追加列表的 python 实现更快,如上面的编辑所示) :

$ /usr/bin/time cat test_lines_double | ./split6
15.09 real         0.01 user         0.45 sys
C++   : Saw 20000000 lines in 15 seconds.  Crunch speed: 1333333

我仅有的一点小小的抱怨是关于在这种情况下执行 C + + 所需的代码量。

从这个问题和昨天的标准输入行阅读问题(链接在上面)中得到的教训之一是,人们应该始终进行基准测试,而不是对语言的相对“默认”性能做出幼稚的假设。我很感激你的教育。

再次感谢大家的建议!

17922 次浏览

I suspect that this is related to buffering on sys.stdin in Python, but no buffering in the C++ implementation.

See this post for details on how to change the buffer size, then try the comparison again: Setting smaller buffer size for sys.stdin?

I suspect that this is because of the way std::vector gets resized during the process of a push_back() function call. If you try using std::list or std::vector::reserve() to reserve enough space for the sentences, you should get a much better performance. Or you could use a combination of both like below for split1():

void split1(vector<string> &tokens, const string &str,
const string &delimiters = " ") {
// Skip delimiters at beginning
string::size_type lastPos = str.find_first_not_of(delimiters, 0);


// Find first non-delimiter
string::size_type pos = str.find_first_of(delimiters, lastPos);
list<string> token_list;


while (string::npos != pos || string::npos != lastPos) {
// Found a token, add it to the list
token_list.push_back(str.substr(lastPos, pos - lastPos));
// Skip delimiters
lastPos = str.find_first_not_of(delimiters, pos);
// Find next non-delimiter
pos = str.find_first_of(delimiters, lastPos);
}
tokens.assign(token_list.begin(), token_list.end());
}

EDIT: The other obvious thing I see is that Python variable dummy gets assigned each time but not modified. So it's not a fair comparison against C++. You should try modifying your Python code to be dummy = [] to initialize it and then do dummy += line.split(). Can you report the runtime after this?

EDIT2: To make it even more fair can you modify the while loop in C++ code to be:

    while(cin) {
getline(cin, input_line);
std::vector<string> spline; // create a new vector


//I'm trying one of the two implementations, per compilation, obviously:
//        split1(spline, input_line);
split2(spline, input_line);


count++;
};

You're making the mistaken assumption that your chosen C++ implementation is necessarily faster than Python's. String handling in Python is highly optimized. See this question for more: Why do std::string operations perform poorly?

As a guess, Python strings are reference counted immutable strings, so that no strings are copied around in the Python code, while C++ std::string is a mutable value type, and is copied at the smallest opportunity.

If the goal is fast splitting, then one would use constant time substring operations, which means only referring to parts of the original string, as in Python (and Java, and C#…).

The C++ std::string class has one redeeming feature, though: it is standard, so that it can be used to pass strings safely and portably around where efficiency is not a main consideration. But enough chat. Code -- and on my machine this is of course faster than Python, since Python's string handling is implemented in C which is a subset of C++ (he he):

#include <iostream>
#include <string>
#include <sstream>
#include <time.h>
#include <vector>


using namespace std;


class StringRef
{
private:
char const*     begin_;
int             size_;


public:
int size() const { return size_; }
char const* begin() const { return begin_; }
char const* end() const { return begin_ + size_; }


StringRef( char const* const begin, int const size )
: begin_( begin )
, size_( size )
{}
};


vector<StringRef> split3( string const& str, char delimiter = ' ' )
{
vector<StringRef>   result;


enum State { inSpace, inToken };


State state = inSpace;
char const*     pTokenBegin = 0;    // Init to satisfy compiler.
for( auto it = str.begin(); it != str.end(); ++it )
{
State const newState = (*it == delimiter? inSpace : inToken);
if( newState != state )
{
switch( newState )
{
case inSpace:
result.push_back( StringRef( pTokenBegin, &*it - pTokenBegin ) );
break;
case inToken:
pTokenBegin = &*it;
}
}
state = newState;
}
if( state == inToken )
{
result.push_back( StringRef( pTokenBegin, &*str.end() - pTokenBegin ) );
}
return result;
}


int main() {
string input_line;
vector<string> spline;
long count = 0;
int sec, lps;
time_t start = time(NULL);


cin.sync_with_stdio(false); //disable synchronous IO


while(cin) {
getline(cin, input_line);
//spline.clear(); //empty the vector for the next line to parse


//I'm trying one of the two implementations, per compilation, obviously:
//        split1(spline, input_line);
//split2(spline, input_line);


vector<StringRef> const v = split3( input_line );
count++;
};


count--; //subtract for final over-read
sec = (int) time(NULL) - start;
cerr << "C++   : Saw " << count << " lines in " << sec << " seconds." ;
if (sec > 0) {
lps = count / sec;
cerr << "  Crunch speed: " << lps << endl;
} else
cerr << endl;
return 0;
}


//compiled with: g++ -Wall -O3 -o split1 split_1.cpp -std=c++0x

Disclaimer: I hope there aren't any bugs. I haven't tested the functionality, but only checked the speed. But I think, even if there is a bug or two, correcting that won't significantly affect the speed.

void split5(vector<string> &tokens, const string &str, char delim=' ') {


enum { do_token, do_delim } state = do_delim;
int idx = 0, tok_start = 0;
for (string::const_iterator it = str.begin() ; ; ++it, ++idx) {
switch (state) {
case do_token:
if (it == str.end()) {
tokens.push_back (str.substr(tok_start, idx-tok_start));
return;
}
else if (*it == delim) {
state = do_delim;
tokens.push_back (str.substr(tok_start, idx-tok_start));
}
break;


case do_delim:
if (it == str.end()) {
return;
}
if (*it != delim) {
state = do_token;
tok_start = idx;
}
break;
}
}
}

I'm not providing any better solutions (at least performance-wise), but some additional data that could be interesting.

Using strtok_r (reentrant variant of strtok):

void splitc1(vector<string> &tokens, const string &str,
const string &delimiters = " ") {
char *saveptr;
char *cpy, *token;


cpy = (char*)malloc(str.size() + 1);
strcpy(cpy, str.c_str());


for(token = strtok_r(cpy, delimiters.c_str(), &saveptr);
token != NULL;
token = strtok_r(NULL, delimiters.c_str(), &saveptr)) {
tokens.push_back(string(token));
}


free(cpy);
}

Additionally using character strings for parameters, and fgets for input:

void splitc2(vector<string> &tokens, const char *str,
const char *delimiters) {
char *saveptr;
char *cpy, *token;


cpy = (char*)malloc(strlen(str) + 1);
strcpy(cpy, str);


for(token = strtok_r(cpy, delimiters, &saveptr);
token != NULL;
token = strtok_r(NULL, delimiters, &saveptr)) {
tokens.push_back(string(token));
}


free(cpy);
}

And, in some cases, where destroying the input string is acceptable:

void splitc3(vector<string> &tokens, char *str,
const char *delimiters) {
char *saveptr;
char *token;


for(token = strtok_r(str, delimiters, &saveptr);
token != NULL;
token = strtok_r(NULL, delimiters, &saveptr)) {
tokens.push_back(string(token));
}
}

The timings for these are as follows (including my results for the other variants from the question and the accepted answer):

split1.cpp:  C++   : Saw 20000000 lines in 31 seconds.  Crunch speed: 645161
split2.cpp:  C++   : Saw 20000000 lines in 45 seconds.  Crunch speed: 444444
split.py:    Python: Saw 20000000 lines in 33 seconds.  Crunch Speed: 606060
split5.py:   Python: Saw 20000000 lines in 35 seconds.  Crunch Speed: 571428
split6.cpp:  C++   : Saw 20000000 lines in 18 seconds.  Crunch speed: 1111111


splitc1.cpp: C++   : Saw 20000000 lines in 27 seconds.  Crunch speed: 740740
splitc2.cpp: C++   : Saw 20000000 lines in 22 seconds.  Crunch speed: 909090
splitc3.cpp: C++   : Saw 20000000 lines in 20 seconds.  Crunch speed: 1000000

As we can see, the solution from the accepted answer is still fastest.

For anyone who would want to do further tests, I also put up a Github repo with all the programs from the question, the accepted answer, this answer, and additionally a Makefile and a script to generate test data: https://github.com/tobbez/string-splitting.

If you take the split1 implementaion and change the signature to more closely match that of split2, by changing this:

void split1(vector<string> &tokens, const string &str, const string &delimiters = " ")

to this:

void split1(vector<string> &tokens, const string &str, const char delimiters = ' ')

You get a more dramatic difference between split1 and split2, and a fairer comparison:

split1  C++   : Saw 10000000 lines in 41 seconds.  Crunch speed: 243902
split2  C++   : Saw 10000000 lines in 144 seconds.  Crunch speed: 69444
split1' C++   : Saw 10000000 lines in 33 seconds.  Crunch speed: 303030

I think the following code is better, using some C++17 and C++14 features:

// These codes are un-tested when I write this post, but I'll test it
// When I'm free, and I sincerely welcome others to test and modify this
// code.


// C++17
#include <istream>     // For std::istream.
#include <string_view> // new feature in C++17, sizeof(std::string_view) == 16 in libc++ on my x86-64 debian 9.4 computer.
#include <string>
#include <utility>     // C++14 feature std::move.


template <template <class...> class Container, class Allocator>
void split1(Container<std::string_view, Allocator> &tokens,
std::string_view str,
std::string_view delimiter = " ")
{
/*
* The model of the input string:
*
* (optional) delimiter | content | delimiter | content | delimiter|
* ... | delimiter | content
*
* Using std::string::find_first_not_of or
* std::string_view::find_first_not_of is a bad idea, because it
* actually does the following thing:
*
*     Finds the first character not equal to any of the characters
*     in the given character sequence.
*
* Which means it does not treeat your delimiters as a whole, but as
* a group of characters.
*
* This has 2 effects:
*
*  1. When your delimiters is not a single character, this function
*  won't behave as you predicted.
*
*  2. When your delimiters is just a single character, the function
*  may have an additional overhead due to the fact that it has to
*  check every character with a range of characters, although
* there's only one, but in order to assure the correctness, it still
* has an inner loop, which adds to the overhead.
*
* So, as a solution, I wrote the following code.
*
* The code below will skip the first delimiter prefix.
* However, if there's nothing between 2 delimiter, this code'll
* still treat as if there's sth. there.
*
* Note:
* Here I use C++ std version of substring search algorithm, but u
* can change it to Boyer-Moore, KMP(takes additional memory),
* Rabin-Karp and other algorithm to speed your code.
*
*/


// Establish the loop invariant 1.
typename std::string_view::size_type
next,
delimiter_size = delimiter.size(),
pos = str.find(delimiter) ? 0 : delimiter_size;


// The loop invariant:
//  1. At pos, it is the content that should be saved.
//  2. The next pos of delimiter is stored in next, which could be 0
//  or std::string_view::npos.


do {
// Find the next delimiter, maintain loop invariant 2.
next = str.find(delimiter, pos);


// Found a token, add it to the vector
tokens.push_back(str.substr(pos, next));


// Skip delimiters, maintain the loop invariant 1.
//
// @ next is the size of the just pushed token.
// Because when next == std::string_view::npos, the loop will
// terminate, so it doesn't matter even if the following
// expression have undefined behavior due to the overflow of
// argument.
pos = next + delimiter_size;
} while(next != std::string_view::npos);
}


template <template <class...> class Container, class traits, class Allocator2, class Allocator>
void split2(Container<std::basic_string<char, traits, Allocator2>, Allocator> &tokens,
std::istream &stream,
char delimiter = ' ')
{
std::string<char, traits, Allocator2> item;


// Unfortunately, std::getline can only accept a single-character
// delimiter.
while(std::getline(stream, item, delimiter))
// Move item into token. I haven't checked whether item can be
// reused after being moved.
tokens.push_back(std::move(item));
}

The choice of container:

  1. std::vector.

    Assuming the initial size of allocated internal array is 1, and the ultimate size is N, you will allocate and deallocate for log2(N) times, and you will copy the (2 ^ (log2(N) + 1) - 1) = (2N - 1) times. As pointed out in Is the poor performance of std::vector due to not calling realloc a logarithmic number of times?, this can have a poor performance when the size of vector is unpredictable and could be very large. But, if you can estimate the size of it, this'll be less a problem.

  2. std::list.

    For every push_back, the time it consumed is a constant, but it'll probably takes more time than std::vector on individual push_back. Using a per-thread memory pool and a custom allocator can ease this problem.

  3. std::forward_list.

    Same as std::list, but occupy less memory per element. Require a wrapper class to work due to the lack of API push_back.

  4. std::array.

    If you can know the limit of growth, then you can use std::array. Of cause, you can't use it directly, since it doesn't have the API push_back. But you can define a wrapper, and I think it's the fastest way here and can save some memory if your estimation is quite accurate.

  5. std::deque.

    This option allows you to trade memory for performance. There'll be no (2 ^ (N + 1) - 1) times copy of element, just N times allocation, and no deallocation. Also, you'll has constant random access time, and the ability to add new elements at both ends.

According to std::deque-cppreference

On the other hand, deques typically have large minimal memory cost; a deque holding just one element has to allocate its full internal array (e.g. 8 times the object size on 64-bit libstdc++; 16 times the object size or 4096 bytes, whichever is larger, on 64-bit libc++)

or you can use combo of these:

  1. std::vector< std::array<T, 2 ^ M> >

    This is similar to std::deque, the difference is just this container doesn't support to add element at the front. But it is still faster in performance, due to the fact that it won't copy the underlying std::array for (2 ^ (N + 1) - 1) times, it'll just copy the pointer array for (2 ^ (N - M + 1) - 1) times, and allocating new array only when the current is full and doesn't need to deallocate anything. By the way, you can get constant random access time.

  2. std::list< std::array<T, ...> >

    Greatly ease the pressure of memory framentation. It will only allocate new array when the current is full, and does not need to copy anything. You will still have to pay the price for an additional pointer conpared to combo 1.

  3. std::forward_list< std::array<T, ...> >

    Same as 2, but cost the same memory as combo 1.