views:

5223

answers:

9

I have a string that I would like to tokenize. But the strtok() function requires my string to be a char*.

How can I do this quickly?

token = strtok(str.c_str(), " "); fails because it turns it into a const char*, not a char*

A: 

To start with, you may want to mention the language involved.

But given your syntax, whatever language you are talking about is very likely to already have a tokenize built for the standard String class. Use that.

Edit: I mentioned Split, but of course you'll have to be using managed C++ in Visual Studio for that to work. You can look around for standard library tokenizes as well if you need a more cross-platform solution.

Some posters do not seem to feel you need to provide any data on the language used for any question, but I say this - how is a user with a similar issue supposed to find the answer later if they are searching by language? You should all remember that Stack Overflow is not here to answer one users question, but many users hereafter...

At least use the C++ tag (which I just realized I had the power to add thanks to the community wiki model here, and have done so).

Kendall Helmstetter Gelner
This is really clearly C/C++, for anyone who actually knows them. Perhaps to start wtih, you should avoid pointlessly answering questions you know nothing about.
tgamblin
Regardless of the language, the point stands - you are better off looking to use a language native method of tokenizing strings than to revert to underlying C functions.The web has many C++ standard library tonkenizers, use one of those.You'll appreciate this after using a few more languages.
Kendall Helmstetter Gelner
STL does not have a tokenizer. Boost does, but it is not the "C++ standard library."
Sherm Pendley
+3  A: 

Duplicate the string, tokenize it, then free it.

char *dup = strdup(str.c_str());
token = strtok(dup, " ");
free(dup);
DocMax
Isn't the better question, why use strtok when the language in question has better native options?
Kendall Helmstetter Gelner
Not necessarily. If the context of the question surrounds maintaining a fragile codebase, then stepping away from the existing approach (notionally strtok in my example) is riskier than changing the approach. Without more context in the question, I prefer to answer what is asked.
DocMax
If the asker is a newbie, you should want against doing free() before using token... :-)
PhiLho
I am dubious that using a more robust native tokenizer is ever less safe than inserting new code that calls a library that inserts nulls into the block of memory passed to it... that's why I did not think it a good idea to answer the question as asked.
Kendall Helmstetter Gelner
+1  A: 

I suppose the language is C, or C++...

strtok, IIRC, replace separators with \0. That's what it cannot use a const string. To workaround that "quickly", if the string isn't huge, you can just strdup() it. Which is wise if you need to keep the string unaltered (what the const suggest...).

On the other hand, you might want to use another tokenizer, perhaps hand rolled, less violent on the given argument.

PhiLho
+5  A: 
  1. If boost is available on your system (I think it's standard on most Linux distros these days), it has a Tokenizer class you can use.

  2. If not, then a quick Google turns up a hand-rolled tokenizer for std::string that you can probably just copy and paste. It's very short.

  3. And, if you don't like either of those, then here's a split() function I wrote to make my life easier. It'll break a string into pieces using any of the chars in "delim" as separators. Pieces are appended to the "parts" vector:

    void split(const string& str, const string& delim, vector<string>& parts) {
      size_t start, end = 0;
      while (end < str.size()) {
        start = end;
        while (start < str.size() && (delim.find(str[start]) != string::npos)) {
          start++;  // skip initial whitespace
        }
        end = start;
        while (end < str.size() && (delim.find(str[end]) == string::npos)) {
          end++; // skip to end of word
        }
        if (end-start != 0) {  // just ignore zero-length strings.
          parts.push_back(string(str, start, end-start));
        }
      }
    }
    
tgamblin
I find it hilarious that you vote me down for pointing out a native solution is superior, and then proceed to not answer the original question but instead - a native solution. Since your answer is the most comprehensive I voted you up despite your acerbic personality.
Kendall Helmstetter Gelner
I actually didn't vote you down for suggesting that he use a library. I voted you down for saying "you should mention the language" and not providing a real solution when it's clear what the language is. But I'm just in a bad mood and you seem like a nice guy so I removed my downvote :-). Cheers.
tgamblin
Thank you, believe me I do not like to stray into answering questions for things I don't have much recent experience with, but I felt the need to point out answering the question as is was much less useful to the original poster than really providing a direction for what he should be looking for.
Kendall Helmstetter Gelner
A: 

Assuming that by "string" you're talking about std::string in C++, you might have a look at the Tokenizer package in Boost.

Sherm Pendley
+4  A: 
#include <iostream>
#include <string>
#include <sstream>

std::string myText("some-text-to-tokenize");
std::istringstream iss(myText);
std::string token;
while(getline(iss, token, '-'))
{
      std::cout << token << std::endl;
}

Or, as mentioned, use boost for more flexibility.

Chris Blackwell
Just in case, if someone decides to add namespace prefix to getline() call (or find it in some docs), this will be std::getline(), not istream::getline(). (The latter will not compile actually.)
Linulin
A: 

EDIT: usage of const cast is only used to demonstrate the effect of strtok() when applied to a pointer returned by string::c_str().

You should not use strtok() since it modifies the tokenized string which may lead to undesired, if not undefined, behaviour as the C string "belongs" to the string instance.

#include <string>
#include <iostream>

int main(int ac, char **av)
{
    std::string theString("hello world");
    std::cout << theString << " - " << theString.size() << std::endl;

    //--- this cast *only* to illustrate the effect of strtok() on std::string 
    char *token = strtok(const_cast<char  *>(theString.c_str()), " ");

    std::cout << theString << " - " << theString.size() << std::endl;

    return 0;
}

After the call to strtok(), the space was "removed" from the string, or turned down to a non-printable character, but the length remains unchanged.

>./a.out
hello world - 11
helloworld - 11

Therefore you have to resort to native mechanism, duplication of the string or an third party library as previously mentioned.

philippe
casting away the const does not help. It is const for a reason.
Martin York
@Martin York: Agreed. It's const for a reason - down voted.
Sherm Pendley
@Martin York, @Sherm Pendley : did you read the conclusion or only the code snippet ? I edited my answer to clarify what I wanted to show here. Rgds.
philippe
@Philippe - Yes, I only read the code. A lot of people will do that, and go straight to the code and skip the explanation. Perhaps putting the explanation in the code, as a comment, would be a good idea? Anyhow, I removed my down vote.
Sherm Pendley
Right, I'll add a comment, thx
philippe
A: 

First off I would say use boost tokenizer.
Alternatively if your data is space separated then the string stream library is very useful.

But both the above have already been covered.
So as a third C-Like alternative I propose copying the std::string into a buffer for modification.

std::string   data("The data I want to tokenize");

// Create a buffer of the correct length:
std::vector<char>  buffer(data.size()+1);

// copy the string into the buffer
strcpy(&buffer[0],data.c_str());

// Tokenize
strtok(&buffer[0]," ");
Martin York
A: 
Martin Dimitrov