views:

891

answers:

7

I'm writing a scanner as part of a compiler.

I'm having a major headache trying to write this one portion:

I need to be able to parse a stream of tokens and push them one by one into a vector, ignoring whitespace and tokenizing special symbols (simple case, lets just consider parentheses and braces)

Example: int main(){ }

should parse into 6 different tokens:

  1. int
  2. main
  3. (
  4. )
  5. {
  6. }

How would you go about solving this? I'm writing this in C++, but a java/C# solution would be appreciated as well.

Some points:

  1. and no, I can't use Boost, I can't guarantee that the libraries will be available to me. (don't ask...)

  2. I don't want to use lex, or any other special tools. I've never done this before and just want to try this once to say I've done it.

+2  A: 

Buy a copy of Compilers: Principles, Techniques, and Tools (the Dragon Book). What you're attempting to write is a lexer, not a "scanner".

Roger Lipscombe
That's an awefully in-depth book for just learning how to lex a string.
Eclipse
Stop suggesting that book already...
artificialidiot
To be fair, it IS a great book. albeit, more in-depth than is required for this task...
Brian Postow
He says he's writing it as part of a compiler. I figured that he might want to know how the rest of a compiler usually works.
Roger Lipscombe
+5  A: 

Stroustrup's book, The C++ Programming Language, has a great example in it about building a lexer/parser for a simple calculator program. It should serve as a good starting point to learn how to do what you want.

Kristo
+3  A: 

Why write your own - look at Lex.

If youmust have your own, you just read the input character by character and maintain some minimum state to accumulate identifiers.

The problem itself is not hard. If you can't solve it, you must be burned out, you just need a rest. Look at it again in the morning.

Arkadiy
"The problem itself is not hard." ::shrugs:: It is not uncommon for people to do dumb things the first time they see this problem. 'Course, I found it instructive to do it wrong a few times, and it is a small enough problem for that to happen quickly...
dmckee
dmckee, if I could upvote you I would. I'm getting a little sick of these "oh, thats a trivial problem" type responses...
eviljack
@eviljack: Don't be too hard on Arkadiy, lexing really _isn't_ hard. There are efficient solutions that can be discovered fairly easily. But most folks I've talk to had to try a couple of wrong ways first.
dmckee
d'oh! Guess I really am that sleep deprived.
eviljack
If @eviljack is writing a compiler, lexing really should not be hard for him. I assume he's just having a brain freeze.
Arkadiy
no, I think I really am that dumb...
eviljack
A: 

Is strtok a valid answer?

Brian Postow
how many brian postow's could there possibly be... I wonder...
Dan
I'm fairly certain that I'm the only one on the planet. Why? Do I know you?
Brian Postow
+1  A: 

umm.. I'd just do a while loop with iterators testing each character for type, and only an alpha to non alpha change, dump the string if it's non empty. if it's a non alpha non white space character, I'd just push it onto the token stack, this is really a trivial parsing task. Shoot, I've been meaning to learn lexx/yacc, but the level of parsing you want is really easy. I wrote a html tokenizer once which is more complicated that this.. I mean you are just looking for names, white space and single non alphanumeric characters.. just do it.

Dan
+2  A: 

If you really want to learn something from this exercise, just start coding. It doesn't demand a lot of code, so you can fail repeatedly without blowing more than an afternoon.

At this point you'll have a good feel for the problem.

Then look in any random compilers book to see what the "usual" methods are, and you'll grok then immediately.

dmckee
+1  A: 

If you want to write this from scratch, you could look into writing a finite state machine (states in an enum, a big switch/case block for state switching). You'd have to push the state to a stack since everything can be nested.

I know that this is not the ideal method; I'm just trying to directly address the question.

Ates Goral