如何迭代字符串的单词？

如何迭代由空格分隔的单词组成的字符串中的单词？

注意，我对C字符串函数或那种字符操作/访问不感兴趣。比起效率，我更喜欢优雅。我当前的解决方案：

#include <iostream>
#include <sstream>
#include <string>

using namespace std;

int main() {
    string s = "Somewhere down the road";
    istringstream iss(s);

    do {
        string subs;
        iss >> subs;
        cout << "Substring: " << subs << endl;
    } while (iss);
}

当前回答

我有一种与其他解决方案非常不同的方法，它提供了很多其他解决方案所缺乏的价值，但当然也有其缺点。这是一个工作实现，示例是在单词周围放置＜tag＞＜/tag＞。

首先，这个问题可以通过一个循环解决，不需要额外的内存，只需考虑四种逻辑情况。从概念上讲，我们对边界感兴趣。我们的代码应该反映出这一点：让我们遍历字符串，一次查看两个字符，记住字符串的开头和结尾都有特殊情况。

缺点是我们必须编写实现，这有点冗长，但大多是方便的样板。

好处是我们编写了实现，因此很容易根据特定的需要定制它，例如区分左和写单词边界，使用任何一组分隔符，或处理其他情况，例如无边界或错误位置。

using namespace std;

#include <iostream>
#include <string>

#include <cctype>

typedef enum boundary_type_e {
    E_BOUNDARY_TYPE_ERROR = -1,
    E_BOUNDARY_TYPE_NONE,
    E_BOUNDARY_TYPE_LEFT,
    E_BOUNDARY_TYPE_RIGHT,
} boundary_type_t;

typedef struct boundary_s {
    boundary_type_t type;
    int pos;
} boundary_t;

bool is_delim_char(int c) {
    return isspace(c); // also compare against any other chars you want to use as delimiters
}

bool is_word_char(int c) {
    return ' ' <= c && c <= '~' && !is_delim_char(c);
}

boundary_t maybe_word_boundary(string str, int pos) {
    int len = str.length();
    if (pos < 0 || pos >= len) {
        return (boundary_t){.type = E_BOUNDARY_TYPE_ERROR};
    } else {
        if (pos == 0 && is_word_char(str[pos])) {
            // if the first character is word-y, we have a left boundary at the beginning
            return (boundary_t){.type = E_BOUNDARY_TYPE_LEFT, .pos = pos};
        } else if (pos == len - 1 && is_word_char(str[pos])) {
            // if the last character is word-y, we have a right boundary left of the null terminator
            return (boundary_t){.type = E_BOUNDARY_TYPE_RIGHT, .pos = pos + 1};
        } else if (!is_word_char(str[pos]) && is_word_char(str[pos + 1])) {
            // if we have a delimiter followed by a word char, we have a left boundary left of the word char
            return (boundary_t){.type = E_BOUNDARY_TYPE_LEFT, .pos = pos + 1};
        } else if (is_word_char(str[pos]) && !is_word_char(str[pos + 1])) {
            // if we have a word char followed by a delimiter, we have a right boundary right of the word char
            return (boundary_t){.type = E_BOUNDARY_TYPE_RIGHT, .pos = pos + 1};
        }
        return (boundary_t){.type = E_BOUNDARY_TYPE_NONE};
    }
}

int main() {
    string str;
    getline(cin, str);

    int len = str.length();
    for (int i = 0; i < len; i++) {
        boundary_t boundary = maybe_word_boundary(str, i);
        if (boundary.type == E_BOUNDARY_TYPE_LEFT) {
            // whatever
        } else if (boundary.type == E_BOUNDARY_TYPE_RIGHT) {
            // whatever
        }
    }
}

正如您所看到的，代码非常容易理解和微调，代码的实际使用非常简短和简单。使用C++不应阻止我们编写最简单、最容易定制的代码，即使这意味着不使用STL。我认为这是Linus Torvalds所说的“品味”的一个例子，因为我们已经消除了所有不需要的逻辑，而写作风格自然允许在需要处理的时候处理更多的案件。

可以改进此代码的可能是使用enum类，在maybe_word_boundary中接受指向is_word_char的函数指针，而不是直接调用is_word_char，并传递lambda。

2019-01-16 15:14:15

其他回答

这里有一个只使用标准正则表达式库的简单解决方案

#include <regex>
#include <string>
#include <vector>

std::vector<string> Tokenize( const string str, const std::regex regex )
{
    using namespace std;

    std::vector<string> result;

    sregex_token_iterator it( str.begin(), str.end(), regex, -1 );
    sregex_token_iterator reg_end;

    for ( ; it != reg_end; ++it ) {
        if ( !it->str().empty() ) //token could be empty:check
            result.emplace_back( it->str() );
    }

    return result;
}

正则表达式参数允许检查多个参数（空格、逗号等）

我通常只选中空格和逗号分隔，所以我也有这个默认函数：

std::vector<string> TokenizeDefault( const string str )
{
    using namespace std;

    regex re( "[\\s,]+" );

    return Tokenize( str, re );
}

“[\\s，]+”检查空格（\\s）和逗号（，）。

注意，如果要拆分wstring而不是string，

将所有std:：regex更改为std:：wregex将所有sregex_token_iterator更改为wsregex_token_idterator

注意，根据编译器的不同，您可能还希望引用字符串参数。

2014-05-06 05:49:21

我用这个分隔符分隔字符串。第一个将结果放入预先构建的向量中，第二个返回新向量。

#include <string>
#include <sstream>
#include <vector>
#include <iterator>

template <typename Out>
void split(const std::string &s, char delim, Out result) {
    std::istringstream iss(s);
    std::string item;
    while (std::getline(iss, item, delim)) {
        *result++ = item;
    }
}

std::vector<std::string> split(const std::string &s, char delim) {
    std::vector<std::string> elems;
    split(s, delim, std::back_inserter(elems));
    return elems;
}

请注意，此解决方案不会跳过空令牌，因此下面将找到4项，其中一项为空：

std::vector<std::string> x = split("one:two::three", ':');

2008-10-25 18:21:27

这是我的版本获取了Kev的来源：

#include <string>
#include <vector>
void split(vector<string> &result, string str, char delim ) {
  string tmp;
  string::iterator i;
  result.clear();

  for(i = str.begin(); i <= str.end(); ++i) {
    if((const char)*i != delim  && i != str.end()) {
      tmp += *i;
    } else {
      result.push_back(tmp);
      tmp = "";
    }
  }
}

之后，调用函数并执行以下操作：

vector<string> hosts;
split(hosts, "192.168.1.2,192.168.1.3", ',');
for( size_t i = 0; i < hosts.size(); i++){
  cout <<  "Connecting host : " << hosts.at(i) << "..." << endl;
}

2011-07-28 12:38:50

STL还没有这样的方法。

但是，您可以通过使用std:：string:：C_str（）成员来使用C的strtok（）函数，也可以编写自己的函数。下面是我在快速谷歌搜索（“STL字符串分割”）后找到的代码示例：

void Tokenize(const string& str,
              vector<string>& tokens,
              const string& delimiters = " ")
{
    // Skip delimiters at beginning.
    string::size_type lastPos = str.find_first_not_of(delimiters, 0);
    // Find first "non-delimiter".
    string::size_type pos     = str.find_first_of(delimiters, lastPos);

    while (string::npos != pos || string::npos != lastPos)
    {
        // Found a token, add it to the vector.
        tokens.push_back(str.substr(lastPos, pos - lastPos));
        // Skip delimiters.  Note the "not_of"
        lastPos = str.find_first_not_of(delimiters, pos);
        // Find next "non-delimiter"
        pos = str.find_first_of(delimiters, lastPos);
    }
}

摘自：http://oopweb.com/CPP/Documents/CPPHOWTO/Volume/C++编程-HOWTO-7.html

如果您对代码示例有疑问，请留下评论，我会解释。

仅仅因为它没有实现称为迭代器的typedef或重载<<运算符，并不意味着它是错误的代码。我经常使用C函数。例如，printf和scanf都比std:：cin和std:：cout快（很明显），fopen语法对二进制类型更友好，它们也倾向于生成更小的EXE。

不要被这种“优雅胜过性能”的交易所吸引。

2008-10-25 09:08:17

这里有一个拆分函数：

是通用的使用标准C++（无增强）接受多个分隔符忽略空标记（可以轻松更改）模板＜typename T＞矢量<T>拆分（常量T&str，常量T&分隔符）{向量<T>v；typename T:：size_type start=0；自动位置=str.find_first_of（分隔符，开始）；而（pos！=T:：npos）{if（pos！=开始）//忽略空标记v.template_back（str，start，pos-start）；开始=位置+1；pos=str.find_first_of（分隔符，开始）；}if（start＜str.length（））//忽略尾随分隔符v.template_back（str，start，str.length（）-start）；//添加字符串的剩余部分返回v；}

示例用法：

    vector<string> v = split<string>("Hello, there; World", ";,");
    vector<wstring> v = split<wstring>(L"Hello, there; World", L";,");

2012-03-13 00:09:42

如何迭代字符串的单词？

推荐文章

最新文章

标签