views:

91457

answers:

17

What's the most elegant way to split a string in C++? The string can be assumed to be composed of words separated by whitespace.

(Note that I'm not interested in C string functions or that kind of character manipulation/access. Also, please give precedence to elegance over efficiency in your answer.)

The best solution I have right now is:

#include <iostream>
#include <sstream>
#include <string>
using namespace std;

int main()
{
    string s("Somewhere down the road");
    istringstream iss(s);

    do
    {
        string sub;
        iss >> sub;
        cout << "Substring: " << sub << endl;
    } while (iss);

    return 0;
}
+1  A: 

The STL does not have such a method available already.

However, you can either use C's strtok function by using the string.c_str() member, or you can write your own. Here is a code sample I found after a quick google search ("STL string split"):

void Tokenize(const string& str,
                      vector<string>& tokens,
                      const string& delimiters = " ")
{
    // Skip delimiters at beginning.
    string::size_type lastPos = str.find_first_not_of(delimiters, 0);
    // Find first "non-delimiter".
    string::size_type pos     = str.find_first_of(delimiters, lastPos);

    while (string::npos != pos || string::npos != lastPos)
    {
        // Found a token, add it to the vector.
        tokens.push_back(str.substr(lastPos, pos - lastPos));
        // Skip delimiters.  Note the "not_of"
        lastPos = str.find_first_not_of(delimiters, pos);
        // Find next "non-delimiter"
        pos = str.find_first_of(delimiters, lastPos);
    }
}

Taken from: http://oopweb.com/CPP/Documents/CPPHOWTO/Volume/C++Programming-HOWTO-7.html

If you have questions about the code sample, leave a comment and I will explain.

And just because it does not implement a typedef called iterator or overload the << operator does not mean it is bad code. I use the C functions quite frequently. For example, printf and scanf both are faster then cin and cout (significantly), the fopen syntax is a lot more friendly for binary types, and they also tend to produce smaller EXEs.

Don't get sold on this "Elegance over performance" deal.

nlaq
I'm aware of the C string functions and I'm aware of the performance issues too (both of which I've noted in my question). However, for this specific question, I'm looking for an elegant C++ solution.
Ashwin
... and you dont want to just build a OO wrapper over the C functions why?
nlaq
@Nelson LaQuet: Let me guess: Because strtok is not reentrant?
paercebal
Why not use the C++ features that are meant for this job?
graham.reeds
@Nelson don't *ever* pass string.c_str() to strtok! strtok trashes the input string (inserts '\0' chars to replace each foudn delimiter) and c_str() returns a non-modifiable string.
Evan Teran
char* ch = new char[str.size()]; strcpy(ch, str.c_str()); ... delete[] ch; // problem solved.
nlaq
@Nelson: That array needs to be of size str.size() + 1 in your last comment. But I agree with your thesis that it's silly to avoid C functions for "aesthetic" reasons.
j_random_hacker
+17  A: 
string word;

istringstream iss(line, istringstream::in);

while( iss >> word )     
{

...

}

This is my favourite way to iterate through a string. You can do what you want per word.

gnomed
Is it possible to declare `word` as a `char`?
abatishchev
Sorry abatishchev, C++ is not my strong point. But I imagine it would not be difficult to add an inner loop to loop through every character in each word. But right now I believe the current loop depends on spaces for word separation. Unless you know that there is only a single character between every space, in which case you can just cast "word" to a char... sorry I cant be of more help, ive been meaning to brush up on my C++
gnomed
if you declare word as a char it will iterate over every non-whitespace character. It's simple enough to try: `stringstream ss("Hello World, this is*@# char c; while(ss >> c) cout << c;`
Wayne Werner
+1  A: 

Using stringstream as you have works perfectly fine, and do exactly what you wanted. If you're just looking for different way of doing things though, you can use find/find_first_of and substring.

#include <iostream>
#include <string>

int main()
{
    std::string s("Somewhere down the road");

    std::string::size_type prev_pos = 0, pos = 0;
    while( (pos = s.find(' ', pos)) != std::string::npos )
    {
        std::string substring( s.substr(prev_pos, pos-prev_pos) );

        std::cout << substring << '\n';

        prev_pos = ++pos;
    }
    std::string substring( s.substr(prev_pos, pos-prev_pos) ); // Last word
    std::cout << substring << '\n';
}
KTC
A: 

For a ridiculously large and probably redundant version, try a lot of for loops.

string stringlist[10];
int count = 0;

for (int i = 0; i < sequence.length(); i++)
{
 if (sequence[i] == ' ')
 {
  stringlist[count] = sequence.substr(0, i);
  sequence.erase(0, i+1);
  i = 0;
  count++;
 }
 else if (i == sequence.length()-1) // Last word
 {
  stringlist[count] = sequence.substr(0, i+1);
 }
}

It isn't pretty, but by and large (Barring punctuation and a slew of other bugs) it works!

Peter C.
I was tempted to +1 this answer for its simple, readable code (which I presume rubbed an elegantophile the wrong way, hence the -1), but then I saw that you allocated a fixed-size array of strings to hold the tokens. Come on, you *know* that's gonna break at the worst possible moment! :)
j_random_hacker
+3  A: 
Shadow2531
Not a perfect answer to his question, but that's exactly what I was looking for. Thanks!
Great answer, elegant code with precisely everything that's needed.
Ilya
+7  A: 

This is similar to this question.

#include <iostream>
#include <string>
#include <boost/foreach.hpp>
#include <boost/tokenizer.hpp>

using namespace std;
using namespace boost;

int main(int argc, char** argv)
{
   string text = "token  test\tstring";

   char_separator<char> sep(" \t");
   tokenizer<char_separator<char>> tokens(text, sep);
   BOOST_FOREACH(string t, tokens)
   {
      cout << t << "." << endl;
   }
}
Ferruccio
Thanks for pointing that out. I didn't know this operation was called tokenizing, so it never occurred to me to search for that term :-)
Ashwin
+50  A: 

I use this to split string by a delim. The first puts the results in an already constructed vector, the second returns a new vector.

std::vector<std::string> &split(const std::string &s, char delim, std::vector<std::string> &elems) {
    std::stringstream ss(s);
    std::string item;
    while(std::getline(ss, item, delim)) {
     elems.push_back(item);
    }
    return elems;
}


std::vector<std::string> split(const std::string &s, char delim) {
    std::vector<std::string> elems;
    return split(s, delim, elems);
}
Evan Teran
i really <3 that solution. one convenient and one fast-without-compromise :)
Johannes Schaub - litb
Works brilliantly! Don't forget to import `string`, `sstring` and `vector`.
Paul Lammertsma
<3 the snippet. thanks a lot.
huy
This hits the sweet spot for me - standard libraries, short, and lets me specify my delimiters. Thanks!
tfinniga
elegant solution, I always forget about this particular "getline", thou I do not believe it is aware of quotes and escape sequences.
boskom
+1 Short and crisp
Favonius
+60  A: 

Since everybody is already using Boost:

#include <boost/algorithm/string.hpp>
std::vector<std::string> strs;
boost::split(strs, "string to split", boost::is_any_of("\t "));

I bet this is much faster than the stringstream solution. And since this is a generic template function it can be used to split other types of strings (wchar, etc. or UTF-8) using all kinds of delimiters.

See the documentation for details.

ididak
This is a good solution too! :-)
Ashwin
Speed is irrelevant here, as both of these cases are much slower than a strtok-like function.
Tom
This is practical and quick enough if you know the line will contain just a few tokens, but if it contains many then you will burn a ton of memory (and time) growing the vector. So no, it's not faster than the stringstream solution -- at least not for large n, which is the only case where speed matters.
j_random_hacker
And for those who don't already have boost... bcp copies over 1,000 files for this :)
romkyns
+56  A: 

FWIW, here's another way to extract tokens from an input string, relying only on Standard Library facilities. It's an example of the power and elegance behind the design of the STL.

#include <iostream>
#include <string>
#include <sstream>
#include <algorithm>
#include <iterator>

int main() {
    using namespace std;
    string sentence = "Something in the way she moves...";
    istringstream iss(sentence);
    copy(istream_iterator<string>(iss),
             istream_iterator<string>(),
             ostream_iterator<string>(cout, "\n"));
}

Instead of copying the extracted tokens to an output stream, one could insert them into a container, using the same generic copy algorithm.

vector<string> tokens;
copy(istream_iterator<string>(iss),
         istream_iterator<string>(),
         back_inserter<vector<string> >(tokens));

Best regards.

Zunino
Your solution doesn't even need Boost. Very cool! :-)
Ashwin
Is it possible to specify a delimiter for this? Like for instance splitting on commas?
l3dx
@l3dx: it seems that the parameter "\n" is the delimiter. This code is very nice, but I would like to know better about it. Maybe somebody could explain each line of that snippet?
Jonathan
@Jonathan: \n is not the delimiter in this case, it's the deliminer for outputting to cout.
huy
So can you split on comma?
graham.reeds
A really nice code, but what about the delimiter? I guess this only works with withespaces.
wok
based on this: http://www.cplusplus.com/reference/algorithm/copy/ no. The whitespace behavior is a function of the `istream_iterator`. It would be more elegant to roll your own.
Wayne Werner
It doesn't work for me for some reasons.. it got crash while running..
Michael Sync
@graham.reeds, @l3dx: Please don't write another CSV parser which can't handle quoted fields: http://en.wikipedia.org/wiki/Comma-separated_values
Douglas
I wasn't planning on it. Never knew CSV had and RFC for it!
graham.reeds
+5  A: 

For those with whom it does not sit well to sacrifice all efficiency for code size and see "efficient" as a type of elegance, the following should hit a sweet spot (and I think the template container class is an awesomely elegant addition.):

template < class ContainerT >
void tokenize(const std::string& str, ContainerT& tokens, const std::string& delimiters = " ", const bool trimEmpty = false)
{
   std::string::size_type pos, lastPos = 0;
   while(true)
   {
      pos = str.find_first_of(delimiters, lastPos);
      if(pos == std::string::npos)
      {
         pos = str.length();

         if(pos != lastPos || !trimEmpty)
            tokens.push_back(ContainerT::value_type(str.data()+lastPos, (ContainerT::value_type::size_type)pos-lastPos ));

         break;
      }
      else
      {
         if(pos != lastPos || !trimEmpty)
            tokens.push_back(ContainerT::value_type(str.data()+lastPos, (ContainerT::value_type::size_type)pos-lastPos ));
      }

      lastPos = pos + 1;
   }
};

I usually choose to use std::vector<std::string> types as my second parameter (ContainerT)... but list<> is way faster than vector<> for when direct access is not needed, and you can even create your own string class and use something like std::list<SubString> where SubString does not do any copies for incredible speed increases.

It's more than double as fast as the fastest tokenize on this page and almost 5 times faster than some others. Also with the perfect parameter types you can eliminate all string and list copies.

Additionally it does not do the (extremely inefficient) return of result, but rather it passes the tokens as a reference, thus also allowing you to build up tokens using multiple calls if you so wished.

Lastly it allows you to specify whether to trim empty tokens from the results via a last optional parameter.

All it needs is std::string... the rest are optional. It does not use streams or the boost library, but is flexible enough to be able to accept some of these foreign types naturally.

Marius.

Marius
+2  A: 

In case anyone is interested, the minimalist version which relies upon getline, is the fastest on my test machine. (Boost based solution not tested !)

Surprise, surprise

Lesson learned, don't reinvent the wheel !

DamnedYankee
"don't reinvent the wheel !" - unless you're a wheel engineer. Also, never forget the "my wheel is better than yours" argument! ;-)
Johann Gerell
+1  A: 

Here's another way of doing it..

void split_string(string text,vector<string>& words)
{
  int i=0;
  char ch;
  string word;

  while(ch=text[i++])
  {
    if (isspace(ch))
    {
      if (!word.empty())
      {
        words.push_back(word);
      }
      word = "";
    }
    else
    {
      word += ch;
    }
  }
  if (!word.empty())
  {
    words.push_back(word);
  }
}
Usama S.
+2  A: 

Yet another flexible and fast way

template<typename Operator>
void tokenize(Operator& op, const char* input, const char* delimiters) {
  const char* s = input;
  const char* e = s;
  while (*e != 0) {
    e = s;
    while (*e != 0 && strchr(delimiters, *e) == 0) ++e;
    if (e - s > 0) {
      op(s, e - s);
    }
    s = e + 1;
  }
}

To use it with a vector of strings:

class Appender : public std::vector<std::string> {
public:
  void operator() (const char* s, unsigned length) { 
    this->push_back(std::string(s,length));
  }
};

Appender v;
tokenize(v, "A number of words to be tokenized", " \t");

That's it! And that's just one way to use the tokenizer, like how to just count words:

class WordCounter {
public:
  WordCounter() : noOfWords(0) {}
  void operator() (const char*, unsigned) {
    ++noOfWords;
  }
  unsigned noOfWords;
};

WordCounter wc;
tokenize(wc, "A number of words to be counted", " \t"); 
ASSERT( wc.noOfWords == 7 );

Limited by imagination ;)

Robert
A: 

There is a function named strtok.

#include<string>
using namespace std;

vector<string> split(char* str,const char* delim)
{
    char* token = strtok(str,delim);

    vector<string> result;

    while(token != NULL)
    {
        result.push_back(token);
        token = strtok(NULL,delim);
    }
    return result;
}
TheMachineCharmer
`strtok` is from the C standard library, not C++. It is not safe to use in multithreaded programs. It modifies the input string.
Kevin Panko
@Kevin Panko: Thanks! Would you please explain why is it not safe to use in multi-threaded programs?
TheMachineCharmer
Because it stores the char pointer from the first call in a static variable, so that on the subsequent calls when NULL is passed, it remembers what pointer should be used. If a second thread calls `strtok` when another thread is still processing, this char pointer will be overwritten, and both threads will then have incorrect results. http://www.mkssoftware.com/docs/man3/strtok.3.asp
Kevin Panko
Thanks @Kevin Panko!! for the eye opener :)
TheMachineCharmer
as mentioned before strtok is unsafe and even in C strtok_r is recommended for use
systemsfault
A: 

See my answer here if you can use Qt.

ShaChris23
A: 

I use this simpleton because we got our String class "special" (i.e. not standard):

void splitString(const String &s, const String &delim, std::vector<String> &result) {
    const int l = delim.length();
    int f = 0;
    int i = s.indexOf(delim,f);
    while (i>=0) {
        String token( i-f > 0 ? s.substring(f,i-f) : "");
        result.push_back(token);
        f=i+l;
        i = s.indexOf(delim,f);
    }
    String token = s.substring(f);
    result.push_back(token);
}
Abe