Java有一个方便的分割方法:

String str = "The quick brown fox";
String[] results = str.split(" ");

在c++中有简单的方法来做到这一点吗?


当前回答

您可以使用流、迭代器和复制算法来相当直接地做到这一点。

#include <string>
#include <vector>
#include <iostream>
#include <istream>
#include <ostream>
#include <iterator>
#include <sstream>
#include <algorithm>

int main()
{
  std::string str = "The quick brown fox";

  // construct a stream from the string
  std::stringstream strstr(str);

  // use stream iterators to copy the stream to the vector as whitespace separated strings
  std::istream_iterator<std::string> it(strstr);
  std::istream_iterator<std::string> end;
  std::vector<std::string> results(it, end);

  // send the vector to stdout.
  std::ostream_iterator<std::string> oit(std::cout);
  std::copy(results.begin(), results.end(), oit);
}

其他回答

我认为这就是字符串流上的>>操作符的用途:

string word; sin >> word;

我以前只用标准库做了一个lexer/tokenizer。代码如下:

#include <iostream>
#include <string>
#include <vector>
#include <sstream>

using namespace std;

string seps(string& s) {
    if (!s.size()) return "";
    stringstream ss;
    ss << s[0];
    for (int i = 1; i < s.size(); i++) {
        ss << '|' << s[i];
    }
    return ss.str();
}

void Tokenize(string& str, vector<string>& tokens, const string& delimiters = " ")
{
    seps(str);

    // Skip delimiters at beginning.
    string::size_type lastPos = str.find_first_not_of(delimiters, 0);
    // Find first "non-delimiter".
    string::size_type pos = str.find_first_of(delimiters, lastPos);

    while (string::npos != pos || string::npos != lastPos)
    {
        // Found a token, add it to the vector.
        tokens.push_back(str.substr(lastPos, pos - lastPos));
        // Skip delimiters.  Note the "not_of"
        lastPos = str.find_first_not_of(delimiters, pos);
        // Find next "non-delimiter"
        pos = str.find_first_of(delimiters, lastPos);
    }
}

int main(int argc, char *argv[])
{
    vector<string> t;
    string s = "Tokens for everyone!";

    Tokenize(s, t, "|");

    for (auto c : t)
        cout << c << endl;

    system("pause");

    return 0;
}

Adam Pierce的回答提供了一个采用const char*的手工标记器。使用迭代器会有一些问题,因为对字符串的结束迭代器进行递增是未定义的。也就是说,给定字符串str{"The quick brown fox"},我们当然可以做到:

auto start = find(cbegin(str), cend(str), ' ');
vector<string> tokens{ string(cbegin(str), start) };

while (start != cend(str)) {
    const auto finish = find(++start, cend(str), ' ');

    tokens.push_back(string(start, finish));
    start = finish;
}

生活的例子


如果你想通过使用标准功能来抽象复杂性,On Freund建议strtok是一个简单的选择:

vector<string> tokens;

for (auto i = strtok(data(str), " "); i != nullptr; i = strtok(nullptr, " ")) tokens.push_back(i);

如果你不能访问c++ 17,你需要像这个例子一样替换data(str): http://ideone.com/8kAGoa

虽然在示例中没有演示,但strtok不需要为每个标记使用相同的分隔符。除了这个优势,还有几个缺点:

strtok cannot be used on multiple strings at the same time: Either a nullptr must be passed to continue tokenizing the current string or a new char* to tokenize must be passed (there are some non-standard implementations which do support this however, such as: strtok_s) For the same reason strtok cannot be used on multiple threads simultaneously (this may however be implementation defined, for example: Visual Studio's implementation is thread safe) Calling strtok modifies the string it is operating on, so it cannot be used on const strings, const char*s, or literal strings, to tokenize any of these with strtok or to operate on a string who's contents need to be preserved, str would have to be copied, then the copy could be operated on


c++20为我们提供了split_view来以非破坏性的方式标记字符串:https://topanswers.xyz/cplusplus?q=749#a874


前面的方法不能就地生成标记化的向量,这意味着如果不将它们抽象为辅助函数,它们就不能初始化const vector<string>令牌。该功能和接受任何空白分隔符的能力可以使用istream_iterator来利用。例如,给定const string str{"The quick \tbrown \nfox"},我们可以这样做:

istringstream is{ str };
const vector<string> tokens{ istream_iterator<string>(is), istream_iterator<string>() };

生活的例子

对于这个选项,需要构造一个istringstream的代价比前面两个选项要大得多,但是这个代价通常隐藏在字符串分配的代价中。


如果上面的选项都不够灵活,不能满足您的标记化需求,那么最灵活的选项是使用regex_token_iterator,当然这种灵活性会带来更大的开销,但同样,这可能隐藏在字符串分配成本中。例如,我们想要基于非转义的逗号进行标记化,也吃空白,给定以下输入:const string str{" the,qu\\,ick,\tbrown, fox"}我们可以这样做:

const regex re{ "\\s*((?:[^\\\\,]|\\\\.)*?)\\s*(?:,|$)" };
const vector<string> tokens{ sregex_token_iterator(cbegin(str), cend(str), re, 1), sregex_token_iterator() };

生活的例子

我知道你想要一个c++的解决方案,但你可能会认为这是有帮助的:

Qt

#include <QString>

...

QString str = "The quick brown fox"; 
QStringList results = str.split(" "); 

在这个例子中,与Boost相比的优势在于,它直接一对一地映射到你的文章代码。

详见Qt文档

Boost::tokenizer是您的好朋友,但是请考虑使用wstring/wchar_t而不是传统的string/char类型,使您的代码在国际化(i18n)问题上具有可移植性。

#include <iostream>
#include <boost/tokenizer.hpp>
#include <string>

using namespace std;
using namespace boost;

typedef tokenizer<char_separator<wchar_t>,
                  wstring::const_iterator, wstring> Tok;

int main()
{
  wstring s;
  while (getline(wcin, s)) {
    char_separator<wchar_t> sep(L" "); // list of separator characters
    Tok tok(s, sep);
    for (Tok::iterator beg = tok.begin(); beg != tok.end(); ++beg) {
      wcout << *beg << L"\t"; // output (or store in vector)
    }
    wcout << L"\n";
  }
  return 0;
}