C++ tokenize a string using a regular expression

Anonymous picture Anonymous · Jun 14, 2009 · Viewed 23.5k times · Source

I'm trying to learn myself some C++ from scratch at the moment.
I'm well-versed in python, perl, javascript but have only encountered C++ briefly, in a classroom setting in the past. Please excuse the naivete of my question.

I would like to split a string using a regular expression but have not had much luck finding a clear, definitive, efficient and complete example of how to do this in C++.

In perl this is action is common, and thus can be accomplished in a trivial manner,

/home/me$ cat test.txt
this is  aXstringYwith, some problems
and anotherXY line with   similar issues

/home/me$ cat test.txt | perl -e'
> while(<>){
>   my @toks = split(/[\sXY,]+/);
>   print join(" ",@toks)."\n";
> }'
this is a string with some problems
and another line with similar issues

I'd like to know how best to accomplish the equivalent in C++.

EDIT:
I think I found what I was looking for in the boost library, as mentioned below.

boost regex-token-iterator (why don't underscores work?)

I guess I didn't know what to search for.


#include <iostream>
#include <boost/regex.hpp>

using namespace std;

int main(int argc)
{
  string s;
  do{
    if(argc == 1)
      {
        cout << "Enter text to split (or \"quit\" to exit): ";
        getline(cin, s);
        if(s == "quit") break;
      }
    else
      s = "This is a string of tokens";

    boost::regex re("\\s+");
    boost::sregex_token_iterator i(s.begin(), s.end(), re, -1);
    boost::sregex_token_iterator j;

    unsigned count = 0;
    while(i != j)
      {
        cout << *i++ << endl;
        count++;
      }
    cout << "There were " << count << " tokens found." << endl;

  }while(argc == 1);
  return 0;
}

Answer

sth picture sth · Jun 14, 2009

The boost libraries are usually a good choice, in this case Boost.Regex. There even is an example for splitting a string into tokens that already does what you want. Basically it comes down to something like this:

boost::regex re("[\\sXY]+");
std::string s;

while (std::getline(std::cin, s)) {
  boost::sregex_token_iterator i(s.begin(), s.end(), re, -1);
  boost::sregex_token_iterator j;
  while (i != j) {
     std::cout << *i++ << " ";
  }
  std::cout << std::endl;
}