ansaurus

Question

Is there a simple way I can tokenize a string without a full-blown lexer?

Answer 1

+5 A:

How about regular expressions? You could easily write regex to split it the way you want, and the JS string.split method accepts regex as the parameter too.

For example... (modify to include all chars you need etc)

/([0-9]+|[*+-\/()])/

Jani Hartikainen 2009-10-19 18:57:00

+1 It breaks for nested parentheses like `'((42 + 7) * 4)'` but that can be fixed by adding parentheses to the second half of the expression: `/([0-9]+|[*+-\/()])/`

brianpeiris 2009-10-19 19:32:42

He is still using the algorithm specified on the wiki page. The pseudo-code says "Read Token".

Simucal 2009-10-19 19:40:50

@Simucal, @KingNestor I'm confused now, isn't this the correct answer?

brianpeiris 2009-10-19 19:45:30

*Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.* http://www.codinghorror.com/blog/archives/001016.html

OscarRyz 2009-10-19 19:46:21

@brianpeiris, Oh, I'm not saying this isn't the answer. I was just commenting on the first line of Jani's answer, "If you don't want to write the algorithm specified on the wiki." The algorithm doesn't specify how to read in a token, it simply says "read token". So, in that way KingNestor *is* following the algorithm using this answer.

Simucal 2009-10-19 19:48:37

Excuse me if I'm missing something, but would you not use match, as opposed to split, on that regex? e.g. result = subject.match(/([0-9]+|[*+-\/()])/img);

Andre Artus 2010-06-08 09:47:59

Answer 2

+2 A:

You can use a global match as described at http://mikesamuel.blogspot.com/2009/05/efficient-parsing-in-javascript.html

Basically, you create one regex that describes a token

/[0-9]+|false|true|\(|\)/g

and put the 'g' on the end so it matches globally, and then you call its match method

var tokens = myRegex.match(inputString);

and get back an array.

Mike Samuel 2009-10-20 05:21:21

I think this is the best method. I use result = subject.match(/(-?[0-9]+|[*+-\/()])/g); You get the tokens you need, and the tokens you want :).

Andre Artus 2010-06-08 10:01:26

ansaurus

tags:

views:

answers:

Is there a simple way I can tokenize a string without a full-blown lexer?

related questions