Java有一个方便的分割方法:

String str = "The quick brown fox";
String[] results = str.split(" ");

在c++中有简单的方法来做到这一点吗?


当前回答

下面是一种方法,允许您控制是否包含空标记(如strsep)或排除空标记(如strtok)。

#include <string.h> // for strchr and strlen

/*
 * want_empty_tokens==true  : include empty tokens, like strsep()
 * want_empty_tokens==false : exclude empty tokens, like strtok()
 */
std::vector<std::string> tokenize(const char* src,
                                  char delim,
                                  bool want_empty_tokens)
{
  std::vector<std::string> tokens;

  if (src and *src != '\0') // defensive
    while( true )  {
      const char* d = strchr(src, delim);
      size_t len = (d)? d-src : strlen(src);

      if (len or want_empty_tokens)
        tokens.push_back( std::string(src, len) ); // capture token

      if (d) src += len+1; else break;
    }

  return tokens;
}

其他回答

Adam Pierce的回答提供了一个采用const char*的手工标记器。使用迭代器会有一些问题,因为对字符串的结束迭代器进行递增是未定义的。也就是说,给定字符串str{"The quick brown fox"},我们当然可以做到:

auto start = find(cbegin(str), cend(str), ' ');
vector<string> tokens{ string(cbegin(str), start) };

while (start != cend(str)) {
    const auto finish = find(++start, cend(str), ' ');

    tokens.push_back(string(start, finish));
    start = finish;
}

生活的例子


如果你想通过使用标准功能来抽象复杂性,On Freund建议strtok是一个简单的选择:

vector<string> tokens;

for (auto i = strtok(data(str), " "); i != nullptr; i = strtok(nullptr, " ")) tokens.push_back(i);

如果你不能访问c++ 17,你需要像这个例子一样替换data(str): http://ideone.com/8kAGoa

虽然在示例中没有演示,但strtok不需要为每个标记使用相同的分隔符。除了这个优势,还有几个缺点:

strtok cannot be used on multiple strings at the same time: Either a nullptr must be passed to continue tokenizing the current string or a new char* to tokenize must be passed (there are some non-standard implementations which do support this however, such as: strtok_s) For the same reason strtok cannot be used on multiple threads simultaneously (this may however be implementation defined, for example: Visual Studio's implementation is thread safe) Calling strtok modifies the string it is operating on, so it cannot be used on const strings, const char*s, or literal strings, to tokenize any of these with strtok or to operate on a string who's contents need to be preserved, str would have to be copied, then the copy could be operated on


c++20为我们提供了split_view来以非破坏性的方式标记字符串:https://topanswers.xyz/cplusplus?q=749#a874


前面的方法不能就地生成标记化的向量,这意味着如果不将它们抽象为辅助函数,它们就不能初始化const vector<string>令牌。该功能和接受任何空白分隔符的能力可以使用istream_iterator来利用。例如,给定const string str{"The quick \tbrown \nfox"},我们可以这样做:

istringstream is{ str };
const vector<string> tokens{ istream_iterator<string>(is), istream_iterator<string>() };

生活的例子

对于这个选项,需要构造一个istringstream的代价比前面两个选项要大得多,但是这个代价通常隐藏在字符串分配的代价中。


如果上面的选项都不够灵活,不能满足您的标记化需求,那么最灵活的选项是使用regex_token_iterator,当然这种灵活性会带来更大的开销,但同样,这可能隐藏在字符串分配成本中。例如,我们想要基于非转义的逗号进行标记化,也吃空白,给定以下输入:const string str{" the,qu\\,ick,\tbrown, fox"}我们可以这样做:

const regex re{ "\\s*((?:[^\\\\,]|\\\\.)*?)\\s*(?:,|$)" };
const vector<string> tokens{ sregex_token_iterator(cbegin(str), cend(str), re, 1), sregex_token_iterator() };

生活的例子

另一种快速方法是使用getline。喜欢的东西:

stringstream ss("bla bla");
string s;

while (getline(ss, s, ' ')) {
 cout << s << endl;
}

如果需要,可以创建一个简单的split()方法,返回vector<string>,即 真的有用。

我为自己编写了一个https://stackoverflow.com/a/50247503/3976739的简化版本(可能有一点效率)。我希望这能有所帮助。

void StrTokenizer(string& source, const char* delimiter, vector<string>& Tokens)
{   
   size_t new_index = 0;
   size_t old_index = 0;

   while (new_index != std::string::npos)   
   {
      new_index = source.find(delimiter, old_index);
      Tokens.emplace_back(source.substr(old_index, new_index-old_index));

      if (new_index != std::string::npos)
          old_index = ++new_index;
   }
}

我只是看了所有的答案,找不到下一个前提条件的解决方案:

没有动态内存分配 不使用boost 不使用正则表达式 c++17标准

这就是我的解

#include <iomanip>
#include <iostream>
#include <iterator>
#include <string_view>
#include <utility>

struct split_by_spaces
{
    std::string_view      text;
    static constexpr char delim = ' ';

    struct iterator
    {
        const std::string_view& text;
        std::size_t             cur_pos;
        std::size_t             end_pos;

        std::string_view operator*() const
        {
            return { &text[cur_pos], end_pos - cur_pos };
        }
        bool operator==(const iterator& other) const
        {
            return cur_pos == other.cur_pos && end_pos == other.end_pos;
        }
        bool operator!=(const iterator& other) const
        {
            return !(*this == other);
        }
        iterator& operator++()
        {
            cur_pos = text.find_first_not_of(delim, end_pos);

            if (cur_pos == std::string_view::npos)
            {
                cur_pos = text.size();
                end_pos = cur_pos;
                return *this;
            }

            end_pos = text.find(delim, cur_pos);

            if (cur_pos == std::string_view::npos)
            {
                end_pos = text.size();
            }

            return *this;
        }
    };

    [[nodiscard]] iterator begin() const
    {
        auto start = text.find_first_not_of(delim);
        if (start == std::string_view::npos)
        {
            return iterator{ text, text.size(), text.size() };
        }
        auto end_word = text.find(delim, start);
        if (end_word == std::string_view::npos)
        {
            end_word = text.size();
        }
        return iterator{ text, start, end_word };
    }
    [[nodiscard]] iterator end() const
    {
        return iterator{ text, text.size(), text.size() };
    }
};

int main(int argc, char** argv)
{
    using namespace std::literals;
    auto str = " there should be no memory allocation during parsing"
               "  into words this line and you   should'n create any"
               "  contaner                  for intermediate words  "sv;

    auto comma = "";
    for (std::string_view word : split_by_spaces{ str })
    {
        std::cout << std::exchange(comma, ",") << std::quoted(word);
    }

    auto only_spaces = "                   "sv;
    for (std::string_view word : split_by_spaces{ only_spaces })
    {
        std::cout << "you will not see this line in output" << std::endl;
    }
}

你可以利用boost::make_find_iterator。类似于这个:

template<typename CH>
inline vector< basic_string<CH> > tokenize(
    const basic_string<CH> &Input,
    const basic_string<CH> &Delimiter,
    bool remove_empty_token
    ) {

    typedef typename basic_string<CH>::const_iterator string_iterator_t;
    typedef boost::find_iterator< string_iterator_t > string_find_iterator_t;

    vector< basic_string<CH> > Result;
    string_iterator_t it = Input.begin();
    string_iterator_t it_end = Input.end();
    for(string_find_iterator_t i = boost::make_find_iterator(Input, boost::first_finder(Delimiter, boost::is_equal()));
        i != string_find_iterator_t();
        ++i) {
        if(remove_empty_token){
            if(it != i->begin())
                Result.push_back(basic_string<CH>(it,i->begin()));
        }
        else
            Result.push_back(basic_string<CH>(it,i->begin()));
        it = i->end();
    }
    if(it != it_end)
        Result.push_back(basic_string<CH>(it,it_end));

    return Result;
}