views:

261

answers:

6

This should be an ideal case of not re-inventing the wheel, but so far my search has been in vain.

Instead of writing one myself, I would like to use an existing C++ tokenizer. The tokens are to be used in an index for full text searching. Performance is very important, I will parse many gigabytes of text.

Edit: Please note that the tokens are to be used in a search index. Creating such tokens is not an exact science (afaik) and requires some heuristics. This has been done a thousand time before, and probably in a thousand different ways, but I can't even find one of them :)

Any good pointers?

Thanks!

A: 

If performance is a main issue you should probably stick to good old strtok which is sure to be fast:

/* strtok example */
#include <stdio.h>
#include <string.h>

int main ()
{
  char str[] ="- This, a sample string.";
  char * pch;
  printf ("Splitting string \"%s\" into tokens:\n",str);
  pch = strtok (str," ,.-");
  while (pch != NULL)
  {
    printf ("%s\n",pch);
    pch = strtok (NULL, " ,.-");
  }
  return 0;
}
clyfe
strtok is **not** a tokenizer. You still have to figure out the difference between a `class` or a `const` or an identifier that happens to be named something like `calculate`.
T.E.D.
A tokenizer *identifies* the tokens and afterwords a *lexical anlizer* categorizes them into tokens ( ie. phrase "joe eats" -> tokenizer -> {joe, eats} -> lexical analizer -> { (joe, noun), (eats, verb) } ). Tokenization is the process of *demarcating* and **possibly** classifying sections of a string of input characters. In the classifying sence neither the boost tokenizer does the classification.
clyfe
http://stackoverflow.com/questions/380455/looking-for-a-clear-definition-of-what-a-tokenizer-parser-and-lexers-are-a
clyfe
Using `C` for "performance" while in `C++` is usually a case of premature optimization... I don't say streams are not a bit slower though ;)
Matthieu M.
@clyfe: Hmmm. Back in my day they used to be synonomous. I see from wikipedia that is no longer the case. Damn kids, killing my grass and changing my language...
T.E.D.
A: 

I might look into std::stringstream from <sstream>. C-style strtok has a number of usability problems, and C-style strings are just troublesome.

Here's an ultra-trivial example of it tokenizing a sentence into words:

#include <sstream>
#include <iostream>
#include <string>

int main(void) 
{
   std::stringstream sentence("This is a sentence with a bunch of words"); 
   while (sentence)
   {
      std::string word;  
      sentence >> word;  
      std::cout << "Got token: " << word << std::endl;
   }
}

janks@phoenix:/tmp$ g++ tokenize.cc && ./a.out
Got token: This
Got token: is
Got token: a
Got token: sentence
Got token: with
Got token: a
Got token: bunch
Got token: of
Got token: words
Got token:

The std::stringstream class is "bi-directional", in that it supports input and output. You'd probably want to do just one or the other, so you'd use std::istringstream or std::ostringstream.

The beauty of them is that they are also std::istream and std::ostreams respectively, so you can use them as you'd use std::cin or std::cout, which are hopefully familiar to you.

Some might argue these classes are expensive to use; std::strstream from <strstream> is mostly the same thing, but built on top of cheaper C-style 0-terminated strings. It might be faster for you. But anyway, I wouldn't worry about performance right away. Get something working, and then benchmark it. Chances are you can get enough speed by simply writing well-written C++ that minimizes unnecessary object creation and destruction. If it's still not fast enough, then you can look elsewhere. These classes are probably fast enough, though. Your CPU can waste thousands of cycles in the amount of time it takes to read a block of data from a hard disk or network.

janks
This a aproach does a bad job on punctuation: "This is: a sentence, with a bunch of words" -> ("This" "is:" "a" "sentence," "with" "a" "bunch" "of" "words"), although i believe it can be overcomed ... also tokenizes only on whitespace: http://codepad.org/m69UhzKN
clyfe
Obviously, hence the "ultra-trivial" comment. Of course there are myriad member functions of `std::istream` that will allow you to customize the tokenization to, for example, use different delimiters, etc. I'm not suggesting that the tokenization should literally be built on top of operator>>, that was just for the trivial example.
janks
A: 

You can use the Ragel State Machine Compiler to create a tokenizer (or a lexical analyzer).

The generated code has no external dependencies.

I suggest you look at the clang.rl sample for a relevant example of the syntax and usage.

Hasturkun
raegel is a full blown parser generator (albeit a fast one), i think it's to much for tokenization (index creation) needs (or even more, completely useless)
clyfe
@clyfe: I don't think so, really... Ragel's own tokenizer is in fact written in ragel, and the output code is very lightweight.
Hasturkun
A: 

Well, I would begin by searching Boost and... hop: Boost.Tokenizer

The nice thing ? By default it breaks on white spaces and punctuation because it's meant for text, so you won't forget a symbol.

From the introduction:

#include<iostream>
#include<boost/tokenizer.hpp>
#include<string>

int main(){
   using namespace std;
   using namespace boost;
   string s = "This is,  a test";
   tokenizer<> tok(s);
   for(tokenizer<>::iterator beg=tok.begin(); beg!=tok.end();++beg){
       cout << *beg << "\n";
   }
}

// prints
This
is
a
test

// notes how the ',' and ' ' were nicely removed

And there are additional features:

  • it can escape characters
  • it is compatible with Iterators so you can use it with an istream directly... and thus with an ifstream

and a few options (like keeping empty tokens etc...)

Check it out!

Matthieu M.
A: 

A regular expression library might work well if your tokens aren't too difficult to parse.

Jay
+2  A: 

I wrote my own tokenizer as part of the open-source SWISH++ indexing and search engine.

There's also the the ICU tokenizer that handles Unicode.

Paul J. Lucas