ansaurus

Question

How do I tokenize a string in C++?

Answer 1

A:

There is no direct way to do this. Refer this code project source code to find out how to build a class for this.

Niyaz 2008-09-10 12:14:59

Answer 2

+6 A:

Here is a sample tokenizer class that might do what you want

//Header file
class Tokenizer 
{
    public:
        static const std::string DELIMITERS;
        Tokenizer(const std::string& str);
        Tokenizer(const std::string& str, const std::string& delimiters);
        bool NextToken();
        bool NextToken(const std::string& delimiters);
        const std::string GetToken() const;
        void Reset();
    protected:
        size_t m_offset;
        const std::string m_string;
        std::string m_token;
        std::string m_delimiters;
}

//CPP file
const string Tokenizer::DELIMITERS(" \t\n\r");

Tokenizer::Tokenizer(const std::string& s) :
    m_string(s), 
    m_offset(0), 
    m_delimiters(DELIMITERS) {}

Tokenizer::Tokenizer(const std::string& s, const std::string& delimiters) :
    m_string(s), 
    m_offset(0), 
    m_delimiters(delimiters) {}

bool Tokenizer::NextToken() 
{
    return NextToken(m_delimiters);
}

bool Tokenizer::NextToken(const std::string& delimiters) 
{
    size_t i = m_string.find_first_not_of(delimiters, m_offset);
    if (string::npos == i) 
    {
        m_offset = m_string.length();
        return false;
    }

    size_t j = m_string.find_first_of(delimiters, i);
    if (string::npos == j) 
    {
        m_token = m_string.substr(i);
        m_offset = m_string.length();
        return true;
    }

    m_token = m_string.substr(i, j - i);
    m_offset = j;
    return true;
}

Example:

std::vector <std::string> v;
Tokenizer s("split this string", " ");
while (s.NextToken())
{
 v.push_back(s.GetToken());
}

vzczc 2008-09-10 12:18:14

Answer 3

+29 A:

Your simple case can easily be built using the string::find method. However, take a look at Boost.Tokenizer. It's great. Boost generally has some very cool string tools.

Konrad Rudolph 2008-09-10 12:18:25

Answer 4

+3 A:

If you're willing to use C, you can use the strtok function. You should pay attention to multi-threading issues when using it.

On Freund 2008-09-10 12:23:33

Note that strtok modifes the string you're checking, so you can't use it on const char * strings without making a copy.

Graeme Perrow 2008-09-10 13:53:20

The multithreading issue is that strtok uses a global variable to keep track of where it is, so if you have two threads that each use strtok, you'll get undefined behavior.

JohnMcG 2008-09-10 15:09:34

Answer 5

A:

Here's a real simple one:

#include <vector>
#include <string>
using namespace std;

vector<string> split(const char *str, char c = ' ')
{
    vector<string> result;

    while(1)
    {
     const char *begin = str;

     while(*str != c && *str)
      str++;

     result.push_back(string(begin, str));

     if(0 == *str++)
      break;
    }

    return result;
}

Adam Pierce 2008-09-10 12:30:06

Answer 6

A:

I thought that was what the << operator on string streams was for:

string word << sin;

EDIT: oops! that should have been:

string word; sin >> word;

Daren Thomas 2008-09-10 12:43:56

My fault for giving a bad (too simple) example. A far as I know, that only works when your delimiter is whitespace.

Bill the Lizard 2008-11-25 18:24:35

Now that I've gotten around to using it, the syntax is sin >> word;

Bill the Lizard 2008-12-08 15:17:43

Answer 7

+17 A:

You can use streams, iterators, and the copy algorithm to do this fairly directly.

#include <string>
#include <vector>
#include <iostream>
#include <istream>
#include <ostream>
#include <iterator>
#include <sstream>
#include <algorithm>

int main()
{
  std::string str = "The quick brown fox";

  // construct a stream from the string
  std::stringstream strstr(str);

  // use stream iterators to copy the stream to the vector as whitespace separated strings
  std::istream_iterator<std::string> it(strstr);
  std::istream_iterator<std::string> end;
  std::vector<std::string> results(it, end);

  // send the vector to stdout.
  std::ostream_iterator<std::string> oit(std::cout);
  std::copy(results.begin(), results.end(), oit);
}

KeithB 2008-09-10 12:46:14

I find those std:: irritating to read.. why not use "using" ?

2008-11-28 04:19:27

@pheze: sir, why don't you edit instead of complaining?

Vadi 2009-10-27 14:28:04

@Vadi: because editing someone else's post is quite intrusive. @pheze: I prefer to let the `std` this way I know where my object comes from, that's merely a matter of style.

Matthieu M. 2010-04-02 08:49:00

@KeithB I understand your reason and I think it's actually a good choice if it works for you, but from a pedagogical standpoint I actually agree with pheze. It's easier to read and understand a completely foreign example like this one with a "using namespace std" at the top because it requires less effort to interpret the following lines... especially in this case because everything is from the standard library. You can make it easy to read and obvious where the objects come from by a series of "using std::string;" etc. Especially since the function is so short.

cheshirekow 2010-07-16 11:27:21

P.S. Thanks... I used this snippet :)

cheshirekow 2010-07-16 11:28:13

Answer 8

+14 A:

Use strtok. In my opinion, there isn't a need to build a class around tokenizing unless strtok doesn't provide you with what you need. It might not, but in 15+ years of writing various parsing code in C and C++, I've always used strtok. Here is an example

char myString[] = "The quick brown fox";
char *p = strtok(myString, " ");
while (p) {
    printf ("Token: %s\n", p);
    p = strtok(NULL, " ");
}

A few caveats (which might not suit your needs). The string is "destroyed" in the process, meaning that EOS characters are placed inline in the delimter spots. Correct usage might require you to make a non-const version of the string. You can also change the list of delimiters mid parse.

In my own opinion, the above code is far simpler and easier to use than writing a separate class for it. To me, this is one of those functions that the language provides and it does it well and cleanly. It's simply a "C based" solution. It's appropriate, it's easy, and you don't have to write a lot of extra code :-)

Mark 2008-09-10 13:37:33

Not that I dislike C, however strtok is not thread-safe, and you need to be certain that the string you send it contains a null character to avoid a possible buffer overflow.

tloach 2010-05-10 13:18:33

There is strtok_r, but this was a C++ question.

Amigable Clark Kant 2010-10-06 09:14:03

Answer 9

+28 A:

The boost tokenizer class can make this sort of thing quite simple:

#include <iostream>
#include <string>
#include <boost/foreach.hpp>
#include <boost/tokenizer.hpp>

using namespace std;
using namespace boost;

int main(int argc, char** argv)
{
   string text = "token, test   string";

   char_separator<char> sep(", ");
   tokenizer<char_separator<char>> tokens(text, sep);
   BOOST_FOREACH(string t, tokens)
   {
      cout << t << "." << endl;
   }
}

Ferruccio 2008-09-11 02:10:33

Good stuff, I've recently utilized this. My Visual Studio compiler has an odd whinge until I use a whitespace to separate the two ">" characters before the tokens(text, sep) bit: (error C2947: expecting '>' to terminate template-argument-list, found '>>')

AndyUK 2010-10-01 15:57:54

Answer 10

+9 A:

Boost has a strong split function: boost::algorithm::split.

Raz 2008-09-12 17:20:23

Answer 11

A:

For simple stuff I just use the following:

unsigned TokenizeString(const std::string& i_source,
         const std::string& i_seperators,
         bool i_discard_empty_tokens,
         std::vector<std::string>& o_tokens)
{
    unsigned prev_pos = 0;
    unsigned pos = 0;
    unsigned number_of_tokens = 0;
    o_tokens.clear();
    pos = i_source.find_first_of(i_seperators, pos);
    while (pos != std::string::npos)
    {
     std::string token = i_source.substr(prev_pos, pos - prev_pos);
     if (!i_discard_empty_tokens || token != "")
     {
      o_tokens.push_back(i_source.substr(prev_pos, pos - prev_pos));
      number_of_tokens++;
     }

     pos++;
     prev_pos = pos;
     pos = i_source.find_first_of(i_seperators, pos);
    }

    if (prev_pos < i_source.length())
    {
     o_tokens.push_back(i_source.substr(prev_pos));
     number_of_tokens++;
    }

    return number_of_tokens;
}

Cowardly disclaimer: I write real-time data processing software where the data comes in through binary files, sockets, or some API call (I/O cards, camera's). I never use this function for something more complicated or time-critical than reading external configuration files on startup.

jilles de wit 2008-09-15 15:28:39

Answer 12

+1 A:

I was originally writing a response to Doug's question: C++ Strings Modifying and Extracting based on Separators (closed)

But since Martin York closed that question with a pointer over here... I'll just generalize my code.

No offense folks, but for such a simple problem, you are making things way too complicated. There are a lot of reasons to use BOOST. But for something this simple, it's like hitting a fly with a 20# sledge.

void
split( vector<string> & theStringVector,  /* Altered/returned value */
       const  string  & theString,
       const  string  & theDelimiter )
{
  UASSERT( theDelimiter.size(), >, 0 ); // My own ASSERT macro.

  size_t  start = 0, end = 0;

  while ( end != string::npos )
  {
    end = theString.find( theDelimiter, start );

      // If at end, use length=maxLength.  Else use length=end-start.
    theStringVector.push_back( theString.substr( start,
                   (end == string::npos) ? string::npos : end - start ) );

      // If at end, use start=maxSize.  Else use start=end+delimiter.
    start = (   ( end > (string::npos - theDelimiter.size()) )
              ?  string::npos  :  end + theDelimiter.size()    );
  }
}

E.g.: (For Doug's case.)

int
main()
{
  vector<string> v;

  split( v, "A:PEP:909:Inventory Item", ":" );

#define SHOW(I,X)   cout << "[" << (I) << "]\t " # X " = \"" << (X) << "\"" << endl

  for( unsigned int i = 0;  i < v.size();   i++ )
    SHOW( i, v[i] );
}

And yes, we could have split() return a new vector rather than passing one in. It's trivial to wrap & overload. But depending on what I'm doing, I often find it better to re-use pre-existing objects rather than always creating new ones. (Just as long as I don't forget to empty the vector in between!)

Reference: http://www.cplusplus.com/reference/string/string/

Mr.Ree 2008-11-28 02:55:51

+1: simplicity is a beautiful thing :)

rubenvb 2010-10-29 21:25:10

Thanks...........

Mr.Ree 2010-10-30 00:14:08

Answer 13

+8 A:

Another quick way is to use getline. Something like:

stringstream ss("bla bla");
string s;

while (getline(ss, s, ' ')) {
 cout << s << endl;
}

If you want, you can make a simple split() method returning a vector<string>, which is really useful.

2008-11-28 04:17:39

Answer 14

+1 A:

MFC/ATL has a very nice tokenizer. From MSDN:

CAtlString str( "%First Second#Third" );
CAtlString resToken;
int curPos= 0;

resToken= str.Tokenize("% #",curPos);
while (resToken != "")
{
   printf("Resulting token: %s\n", resToken);
   resToken= str.Tokenize("% #",curPos);
};

Output

Resulting Token: First
Resulting Token: Second
Resulting Token: Third

Jim In Texas 2009-03-22 02:28:16

Answer 15

+3 A:

I know you asked for a C++ solution, but you might consider this helpful:

Qt

#include <QString>

...

QString str = "The quick brown fox"; 
QStringList results = str.split(" ");

The advantage over Boost in this example is that it's a direct one to one mapping to your post's code.

See more at Qt documentation

ShaChris23 2010-08-04 17:34:03

ansaurus

tags:

views:

answers:

How do I tokenize a string in C++?

related questions