Should I use regex when trying to write a tokenizer?

views:

answers:

Should I use regex when trying to write a tokenizer?

I've written a small and simple tokenizer but without use of regular expressions.

I starts at first index and iterates through every character until end and creates the required tokens.

I showed it to a colleague that said it would've been much simpler to do /that/ with regex without going into any depths.

So should I rewrite and expect it to be "better"?

+1 A:

I don't think so. Regex must be very feature-rich, and because of it your program may run slower.

BlaXpirit 2010-08-17 09:14:35

Depends very much on the language parsed and your definition of "better".

soulmerge 2010-08-17 09:14:46

+2 A:

Usually the tokenization can be done with a finite state machine, which is equivalent to canonical regular expressions. If you make sane regexps it will be much easier to read and maintain than a homebrewed FSA. Use tools like flex or jflex and such, they will compile the regexps into minimal FSAs, giving very good performance. Doing this manually should only be done as an exercise.

Lexers exists in several implementations, quite possibly for your favourite language.

johanbev 2010-08-17 09:23:07

+1 A:

Two questions you should ask:

a) If something should change, which one would be the easiest to maintain?

b) If it is working and you don't expect any change, do you really want to spend more time on it?

I'm sure the performance differences are small enough to ignore. The programming experience, and minimizing potential bugs, is the most important issue.

Peet Brits 2010-08-17 09:30:22

Personally I would go for the regex, simply because it is cooler, but regexes can get complex to read/understand if the definition is too broad.

Peet Brits 2010-08-17 09:37:29

ansaurus

tags:

views:

answers:

Should I use regex when trying to write a tokenizer?

related questions