views:

283

answers:

2

I'm building a compiler/assembler/linker in Java for the x86-32 (IA32) processor targeting Windows.

High-level concepts (I do not have any "source code": there is no syntax nor lexical translation, and all languages are regular) are translated into opcodes, which then are wrapped and outputted to a file. The translation process has several phases, one is the translation between regular languages: the highest-level code is translated into the medium-level code which is then translated into the lowest-level code (probably more than 3 levels).

My problem is the following; if I have higher-level code (X and Y) translated to lower-level code (x, y, U and V), then an example of such a translation is, in pseudo-code:

x + U(f) // generated by X
+
V(f) + y // generated by Y

(An easy example) where V is the opposite of U (compare with a stack push as U and a pop as V). This needs to be 'optimized' into:

x + y

(essentially removing the "useless" code)

My idea was to use regular expressions. For the above case, it'll be a regular expression looking like this: x:(U(x)+V(x)):null, meaning for all x find U(x) followed by V(x) and replace by null. Imagine more complex regular expressions, for more complex optimizations. This should work on all levels.

What do you suggest? What would be a good approach to optimize and produce fast x86 assembly?

+6  A: 

What you should actually do is build an Abstract Syntax Tree (AST).

It is a representation of the source code in the form of a tree, that is much easier to work with, especially to make transformations and optimizations.

That code, represented as a tree, would be something like:

(+
    (+
        x
        (U f))
    (+
        (V f)
        y))

You could then try to make some transformations: a sum of sums is a sum of all the terms:

(+
    x
    (U f)
    (V f)
    y)

Then you could scan the tree and you could have the following rules:

  • (+ (U x) (V x)) = 0, for all x
  • (+ 0 x1 x2 ...) = x, for all x1, x2, ...

Then you would obtain what you are looking for:

(+ x y)

Any good book on compiler-writing will discuss a lot on ASTs. Functional programming languages are specially suited for this task, since in general it is easy to represent trees and to do pattern matching to parse and transform the tree.

Usually, for this task, you should avoid using regular expressions. Regular expressions define what mathematicians call regular languages. Any regular language can be parsed by a set of regular expressions. However, I think your language is not regular, so it cannot be properly parsed by regexps.

People try, and try, and try to parse languages such as HTML using regular expressions. This has been extensively discussed here in SO, and you cannot parse HTML with regular expressions. There will always be an exceptional case in which your regular expressions would fail, and you would have to adapt it.

It might be the same with your language: if it is not regular, you should avoid lots of headaches and not try to parse it (and especially "transform" it) using regular expressions.

Bruno Reis
Yes, I vaguely use the concepts of Abstract Syntax Tree (but I don't call it that way, because I do not have any syntax). You say "then you could scan the tree and you could have the following rules": so you think it's good to use regular expressions, as my idea was?
Pindatjuh
No, regular expressions might not be a good idea. I will comment on that on my answer.
Bruno Reis
Thank you. All level's languages are regular.
Pindatjuh
If they are all regular, you *could* try the hard, err, regular-expression way. However, I'm pretty sure that it would be a lot easier to construct a lexer and parser, that would transform the source into a AST, then optimize it. I pretty sure some can suggest you a port of Lex and Yacc for Java, if you have the possibility to go that way! Anyway, good luck with your compiler.
Bruno Reis
It's not quite Java, but there's a book called "Beginning F#", by Robert Pickering. In his book, chapter 12 (11? don't remember) is dedicated to build a compiler. Only one chapter that teaches you how to do that, exploiting the pattern-matching power of functional languages (in this case, F#).
Bruno Reis
I've updated the question. I presume that you've misinterpreted my question. Thanks for providing the, very useful in a traditional compiler, information.
Pindatjuh
+3  A: 

I'm having a lot of trouble understanding this question, but I think you will find it useful to learn something about term-rewriting systems, which seems to be what you are proposing. Whether the mechanism is tree rewriting (always works) or regular expressions (will work for some languages some of the time and other languages all of the time) is of secondary importance.

It is definitely possible to optimize object code by term rewriting. You probably also will benefit from learning something about peephole optimization; a good place to start, because it is very strong on the fundamentals, is a paper by Davidson and Fraser on a retargetable peephole optimizer. There's also excellent later work by Benitez and Davidson.

Norman Ramsey
Thank you! I'll go for peephole optimization, and I guess I can implement it using regular expressions. "It's definately possible to optimize object code by term rewriting"; that was the answer I was looking for.
Pindatjuh