views:

180

answers:

2

I'm looking to implement the Shunting-yard Algorithm, but I need some help figuring out what the best way to split up a string into its tokens is.

If you notice, the first step of the algorithm is "read a token." This isn't exactly a non-trivial thing to do. Tokens can consist of numbers, operators and parens.

If you are doing something like:

(5+1)

A simple string.split() will give me an array of the tokens { "(", "5", "+", "1", ")" }.

However, it becomes more complicated if you have numbers with multiple digits such as:

((2048*124) + 42)

Now a naive string.split() won't do the trick. The multi-digit numbers are a problem.

I know I could write a lexer, but is there a way to do this without writing a full-blown lexer?

I'm implementing this in JavaScript and I'd like to avoid having to go down the lexer-path if possible. I'll be using the "*", "+", "-" and "/" operators, along with integers.

+5  A: 

How about regular expressions? You could easily write regex to split it the way you want, and the JS string.split method accepts regex as the parameter too.

For example... (modify to include all chars you need etc)

/([0-9]+|[*+-\/()])/
Jani Hartikainen
+1 It breaks for nested parentheses like `'((42 + 7) * 4)'` but that can be fixed by adding parentheses to the second half of the expression: `/([0-9]+|[*+-\/()])/`
brianpeiris
He is still using the algorithm specified on the wiki page. The pseudo-code says "Read Token".
Simucal
@Simucal, @KingNestor I'm confused now, isn't this the correct answer?
brianpeiris
*Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.* http://www.codinghorror.com/blog/archives/001016.html
OscarRyz
@brianpeiris, Oh, I'm not saying this isn't the answer. I was just commenting on the first line of Jani's answer, "If you don't want to write the algorithm specified on the wiki." The algorithm doesn't specify how to read in a token, it simply says "read token". So, in that way KingNestor *is* following the algorithm using this answer.
Simucal
Excuse me if I'm missing something, but would you not use match, as opposed to split, on that regex? e.g. result = subject.match(/([0-9]+|[*+-\/()])/img);
Andre Artus
+2  A: 

You can use a global match as described at http://mikesamuel.blogspot.com/2009/05/efficient-parsing-in-javascript.html

Basically, you create one regex that describes a token

/[0-9]+|false|true|\(|\)/g

and put the 'g' on the end so it matches globally, and then you call its match method

var tokens = myRegex.match(inputString);

and get back an array.

Mike Samuel
I think this is the best method. I use result = subject.match(/(-?[0-9]+|[*+-\/()])/g); You get the tokens you need, and the tokens you want :).
Andre Artus