ansaurus

Question

Need some ideas on how to acomplish this in Java (parsing strings)

Answer 1

+3 A:

There are several solutions that you can use, other than hacking your-own parser:

Use an existing parser, such as this one.
Use BCEL to read bytecode, which includes all fields and variables.
Hack into the compiler or run-time, using annotation processing or mirrors - I'm not sure you can find all identifiers this way, but fields and parameters for sure.

Little Bobby Tables 2010-05-02 06:14:43

I think you can also use, http://asm.ow2.org which also comes with an eclipse plugin.

phoenix24 2010-05-02 07:02:04

Answer 2

A:

Wow, ok. Parsing is hard -- really hard -- to do right. Rolling your own java parser is going to be incredibly difficult to do right. You'll find there are a lot of edge cases you're just not prepared for. To really do it right, and handle all the edge cases, you'll need to write a real parser. A real parser is composed of a number of things:

A lexical analyzer to break the input up into logical chunks
A grammar to determine how to interpret the aforementioned chunks
The actual "parser" which is generated from the grammar using a tool like ANTLR
A symbol table to store identifiers in
An abstract syntax tree to represent the code you've parsed

Once you have all that, you can have a real parser. Of course you could skip the abstract syntax tree, but you need pretty much everything else. That leaves you with writing about 1/3 of a compiler. If you truly want to complete this project yourself, you should see if you can find an example for ANTLR which contains a preexisting java grammar definition. That'll get you most of the way there, and then you'll need to use ANTLR to fill in your symbol table.

Alternately, you could go with the clever solutions suggested by Little Bobby Tables (awesome name, btw Bobby).

Benson 2010-05-02 06:24:48

"Wow, ok. Parsing is hard -- really hard -- to do right" - So you they didn't offer "Compilers 302" at your university? :-)

Stephen C 2010-05-02 07:01:06

In compilation terms, what he wants actually belongs to the lexing phase, not parsing. And lexing is not that hard.

Oak 2010-05-02 07:19:09

I suppose you're right. In a real compiler it would be the parser that actually puts things in the symbol table, but in this case you can find identifiers in the lexer, and that's good enough.

Benson 2010-05-02 20:48:28

@StephenC Yeah, the offered compilers, and it was a 400 level class not everyone took. I did. Out of curiosity, how many compilers have you written? Parsing is certainly nontrivial and if you're starting from String.split(), I think it's fair to say real parsing is really hard.

Benson 2010-05-02 20:51:12

@Benson - "Out of curiosity, how many compilers have you written?" I've lost count. But the first two were hand-written parsers for Pascal ... in 1977. But, my point is that parsing is not hard at all if you use the right tools for the task; e.g. a lexer generator in this case.

Stephen C 2010-05-02 23:24:38

Perhaps you've lost touch with what it's like to write your first lexer / parser.

Benson 2010-05-03 03:04:45

Answer 3

+1 A:

I wouldn't separate the entire file at once according to whitespace. Instead, I would scan the file letter-by-letter, saving every character in a buffer until I'm sure an identifier has been reached.

In pseudo-code:

clean buffer
for each letter l in file:
    if l is '
        toggle "character mode"
    if l is "
        toggle "string mode"
    if l is a letter AND "character mode" is off AND "string mode" is off
        add l to end of buffer
    else
        if buffer is NOT a keyword or a literal
            add buffer to list of identifiers
        clean buffer

Notice some lines here hide further complexity - for example, to check if the buffer is a literal you need to check for both true, false, and null.

In addition, there are more bugs in the pseudo-code - it will find identify things like the e and L parts of literals (e in floating-point literals, L in long literals) as well. I suggest adding additional "modes" to take care of them, but it's a bit tricky.

Also there are a few more things if you want to make sure it's accurate - for example you have to make sure you work with unicode. I would strongly recommend investigating the lexical structure of the language, so you won't miss anything.

EDIT:

This solution can easily be extended to deal with identifiers with numbers, as well as with comments.
Small bug above - you need to handle \" differently than ", same with \' and '.

Oak 2010-05-02 07:10:38

ansaurus

tags:

views:

answers:

Need some ideas on how to acomplish this in Java (parsing strings)

related questions