tags:

views:

108

answers:

3

Sorry I couldn't think of a better title, but thanks for reading!

My ultimate goal is to read a .java file, parse it, and pull out every identifier. Then store them all in a list. Two preconditions are there are no comments in the file, and all identifiers are composed of letters only.

Right now I can read the file, parse it by spaces, and store everything in a list. If anything in the list is a java reserved word, it is removed. Also, I remove any loose symbols that are not attached to anything (brackets and arithmetic symbols).

Now I am left with a bunch of weird strings, but at least they have no spaces in them. I know I am going to have to re-parse everything with a . delimiter in order to pull out identifiers like System.out.print, but what about strings like this example:

Logger.getLogger(MyHash.class.getName()).log(Level.SEVERE,

After re-parsing by . I will be left with more crazy strings like:

getLogger(MyHash

getName())

log(Level

SEVERE,

How am I going to be able to pull out all the identifiers while leaving out all the trash? Just keep re-parsing by every symbol that could exist in java code? That seems rather lame and time consuming. I am not even sure if it would work completely. So, can you suggest a better way of doing this?

+3  A: 

There are several solutions that you can use, other than hacking your-own parser:

  • Use an existing parser, such as this one.
  • Use BCEL to read bytecode, which includes all fields and variables.
  • Hack into the compiler or run-time, using annotation processing or mirrors - I'm not sure you can find all identifiers this way, but fields and parameters for sure.
Little Bobby Tables
I think you can also use, http://asm.ow2.org which also comes with an eclipse plugin.
phoenix24
A: 

Wow, ok. Parsing is hard -- really hard -- to do right. Rolling your own java parser is going to be incredibly difficult to do right. You'll find there are a lot of edge cases you're just not prepared for. To really do it right, and handle all the edge cases, you'll need to write a real parser. A real parser is composed of a number of things:

  1. A lexical analyzer to break the input up into logical chunks
  2. A grammar to determine how to interpret the aforementioned chunks
  3. The actual "parser" which is generated from the grammar using a tool like ANTLR
  4. A symbol table to store identifiers in
  5. An abstract syntax tree to represent the code you've parsed

Once you have all that, you can have a real parser. Of course you could skip the abstract syntax tree, but you need pretty much everything else. That leaves you with writing about 1/3 of a compiler. If you truly want to complete this project yourself, you should see if you can find an example for ANTLR which contains a preexisting java grammar definition. That'll get you most of the way there, and then you'll need to use ANTLR to fill in your symbol table.

Alternately, you could go with the clever solutions suggested by Little Bobby Tables (awesome name, btw Bobby).

Benson
"Wow, ok. Parsing is hard -- really hard -- to do right" - So you they didn't offer "Compilers 302" at your university? :-)
Stephen C
In compilation terms, what he wants actually belongs to the lexing phase, not parsing. And lexing is not that hard.
Oak
I suppose you're right. In a real compiler it would be the parser that actually puts things in the symbol table, but in this case you can find identifiers in the lexer, and that's good enough.
Benson
@StephenC Yeah, the offered compilers, and it was a 400 level class not everyone took. I did. Out of curiosity, how many compilers have you written? Parsing is certainly nontrivial and if you're starting from String.split(), I think it's fair to say real parsing is really hard.
Benson
@Benson - "Out of curiosity, how many compilers have you written?" I've lost count. But the first two were hand-written parsers for Pascal ... in 1977. But, my point is that parsing is not hard at all if you use the right tools for the task; e.g. a lexer generator in this case.
Stephen C
Perhaps you've lost touch with what it's like to write your first lexer / parser.
Benson
+1  A: 

I wouldn't separate the entire file at once according to whitespace. Instead, I would scan the file letter-by-letter, saving every character in a buffer until I'm sure an identifier has been reached.

In pseudo-code:

clean buffer
for each letter l in file:
    if l is '
        toggle "character mode"
    if l is "
        toggle "string mode"
    if l is a letter AND "character mode" is off AND "string mode" is off
        add l to end of buffer
    else
        if buffer is NOT a keyword or a literal
            add buffer to list of identifiers
        clean buffer

Notice some lines here hide further complexity - for example, to check if the buffer is a literal you need to check for both true, false, and null.

In addition, there are more bugs in the pseudo-code - it will find identify things like the e and L parts of literals (e in floating-point literals, L in long literals) as well. I suggest adding additional "modes" to take care of them, but it's a bit tricky.

Also there are a few more things if you want to make sure it's accurate - for example you have to make sure you work with unicode. I would strongly recommend investigating the lexical structure of the language, so you won't miss anything.

EDIT:

  • This solution can easily be extended to deal with identifiers with numbers, as well as with comments.
  • Small bug above - you need to handle \" differently than ", same with \' and '.
Oak