views:

850

answers:

5

I'm tokening with the following, but unsure how to include the delimiters with it.

void Tokenize(const string str, vector<string>& tokens, const string& delimiters)
{

    int startpos = 0;
    int pos = str.find_first_of(delimiters, startpos);
    string strTemp;


    while (string::npos != pos || string::npos != startpos)
    {

     strTemp = str.substr(startpos, pos - startpos);
     tokens.push_back(strTemp.substr(0, strTemp.length()));

        startpos = str.find_first_not_of(delimiters, pos);
        pos = str.find_first_of(delimiters, startpos);

    }
}
A: 

I can't really follow your code, could you post a working program?

Anyway, this is a simple tokenizer, without testing edge cases:

#include <iostream>
#include <string>
#include <vector>

using namespace std;

void tokenize(vector<string>& tokens, const string& text, const string& del)
{
    string::size_type startpos = 0,
     currentpos = text.find(del, startpos);

    do
    {
     tokens.push_back(text.substr(startpos, currentpos-startpos+del.size()));

     startpos = currentpos + del.size();
     currentpos = text.find(del, startpos);
    } while(currentpos != string::npos);

    tokens.push_back(text.substr(startpos, currentpos-startpos+del.size()));
}

Example input, delimiter = $$:

Hello$$Stack$$Over$$$Flow$$$$!

Tokens:

Hello$$
Stack$$
Over$$
$Flow$$
$$
!

Note: I would never use a tokenizer I wrote without testing! please use boost::tokenizer!

AraK
+1 for the Boost.Tokenizer mention
Éric Malenfant
I edited my post to include all of the function.I see what you did, but the delimiters will be a string and each char in the string will be a delimiter. Passed like so " ,.!\n"So a comma, period, exclamation, and new line will be pushed into the vector as well, but not the space. This way I can join the vector back and use a space in between the vector items and rebuild the string.
Jeremiah
comma, period, exclamation, and new line including the space will be the delimiters. sorry wanted to make taht clear.
Jeremiah
Aha :) I think I miss understood the question. I though you want to include the delimiters in with tokens. Why don't you use boost::tokenizer? it exactly does what you want.
AraK
Can I get the tokenizer without the entire library?
Jeremiah
You could use boost::bcp to extract the required headers. It is not that simple but you could try.
AraK
A: 

It depends on whether you want the preceding delimiters, the following delimiters, or both, and what you want to do with strings at the beginning and end of the string that may not have delimiters before/after them.

I'm going to assume you want each word, with its preceding and following delimiters, but NOT any strings of delimiters by themselves (e.g. if there's a delimiter following the last string).

template <class iter>
void tokenize(std::string const &str, std::string const &delims, iter out) { 
    int pos = 0;
    do { 
        int beg_word = str.find_first_not_of(delims, pos);
        if (beg_word == std::string::npos) 
            break;
        int end_word = str.find_first_of(delims, beg_word);
        int beg_next_word = str.find_first_not_of(delims, end_word);
        *out++ = std::string(str, pos, beg_next_word-pos);
        pos = end_word;
    } while (pos != std::string::npos);
}

For the moment, I've written it more like an STL algorithm, taking an iterator for its output instead of assuming it's always pushing onto a collection. Since it depends (for the moment) in the input being a string, it doesn't use iterators for the input.

Jerry Coffin
I want the string "Test string, on the web.\nTest line one." to be tokens like so. I want a space, a commma, a period, and \n to be delimiters.Teststring,ontheweb.\nTestlineone.
Jeremiah
Sorry, it didn't post correctly. After the word delimiter its was supposed to have each thing on a new line.
Jeremiah
A: 

if the delimiters are characters and not strings, then you can use strtok.

sean riley
huh? what's wrong with strtok?
sean riley
A: 

I now this a little sloppy, but this is what I ended up with. I did not want to use boost since this is a school assignment and my instructor wanted me to use find_first_of to accomplish this.

Thanks for everyone's help.

vector<string> Tokenize(const string& strInput, const string& strDelims)
{
 vector<string> vS;

 string strOne = strInput;
 string delimiters = strDelims;

 int startpos = 0;
 int pos = strOne.find_first_of(delimiters, startpos);

 while (string::npos != pos || string::npos != startpos)
 {
  if(strOne.substr(startpos, pos - startpos) != "")
   vS.push_back(strOne.substr(startpos, pos - startpos));

  // if delimiter is a new line (\n) then addt new line
  if(strOne.substr(pos, 1) == "\n")
   vS.push_back("\\n");
  // else if the delimiter is not a space
  else if (strOne.substr(pos, 1) != " ")
   vS.push_back(strOne.substr(pos, 1));

  if( string::npos == strOne.find_first_not_of(delimiters, pos) )
   startpos = strOne.find_first_not_of(delimiters, pos);
  else
   startpos = pos + 1;

        pos = strOne.find_first_of(delimiters, startpos);

 }

 return vS;
}
Jeremiah
+4  A: 

The String Toolkit Library (Strtk) has the following solution:

std::string str = "abc,123 xyz";
std::vector<std::string> token_list;
strtk::split(";., ",
             str,
             strtk::range_to_type_back_inserter(token_list),
             strtk::include_delimiters);

It should result with token_list have the following elements:

Token0 = "abc,"
Token1 = "123 "
Token2 = "xyz"

More examples can be found Here

Beh Tou Cheh