tags:

views:

142

answers:

3

What is a good open source C word tokenizer library?

I am look for something like

Tokenize("there are three apples. One is orange, the other is blue,"
         " and, finally, the last is yellow!")

with the output not containing any punctuation.

A: 

I'd recommend strtok, which is available in string.h.

Peter
On the given string - a string literal that is entitled to be placed in read-only memory - using `strtok()` is likely to give you a core dump or its moral equivalent.
Jonathan Leffler
@Jonathan, if you have a string constant, strdup() it first. That's not a big task.
gnud
A: 

lex/flex is the classic tool, but it may be somewhat heavyweight for what you're doing.

Paul Nathan
+1  A: 

If the only need is to strip the punctuations, I'd use a for cycle that outputs (whatever it means in your context) the source string character by character, skipping the ispunct() ones.

ntd
Also skipping `isspace()` ones...
Jonathan Leffler