views:

1171

answers:

5

I am setting out to do a side project that has the goal of translating code from one programming language to another. The languages I am starting with are PHP and Python (Python to PHP should be easier to start with), but ideally I would be able to add other languages with (relative) ease. The plan is:

  • This is geared towards web development. The original and target code will be be sitting on top of frameworks (which I will also have to write). These frameworks will embrace an MVC design pattern and follow strict coding conventions. This should make translation somewhat easier.

  • I am also looking at IOC and dependency injection, as they might make the translation process easier and less error prone.

  • I'll make use of Python's parser module, which lets me fiddle with the Abstract Syntax Tree. Apparently the closest I can get with PHP is token_get_all(), which is a start.

  • From then on I can build the AST, symbol tables and control flow.

Then I believe I can start outputting code. I don't need a perfect translation. I'll still have to review the generated code and fix problems. Ideally the translator should flag problematic translations.

Before you ask "What the hell is the point of this?" The answer is... It'll be an interesting learning experience. If you have any insights on how to make this less daunting, please let me know.


EDIT:

I am more interested in knowing what kinds of patterns I could enforce on the code to make it easier to translate (ie: IoC, SOA ?) the code than how to do the translation.

+47  A: 

I've been building tools (DMS Software Reengineering Toolkit) to do this kind of thing since 1995, supported by a strong team of computer scientists. It provides generic parsing, AST building, symbol tables, control and data flow analysis, application of translation rules, regeneration of source text with comments, etc., all parameterized by explicit definitions of computer languages.

The amount of machinery you need to do this well is vast, and then you need reliable parsers for langauges with unreliable definitions (PHP is perfect example of this).

There's nothing wrong with you thinking about it or attempting it, but I think you'll find this a much bigger task than you expect. We have some 100 man-years invested in just DMS, and another 6-12 months in each "reliable" language definition (including the one we painfully built for PHP), much more for nasty languages such as C++. It will be a "hell of a learning experience"; it has been for us. (You might find the technical Papers section at the above website interesting to jump start that learning).

People often attempt to build some kind of generalized machinery by starting with some piece of technology with which they are familiar, that does a part of the job. (Python ASTs are great example). The good news, is that part of the job is done. The bad news is that machinery has a zillion assumptions built into it, most of which you won't discover until you try to wrestle it into doing something else. At that point you find out the machinery is wired to do what it originally does, and will really, really resist your attempt to make it do something else. (I suspect trying to get the Python AST to model PHP is going to be a lot of fun).

The reason I started to build DMS originally was to build foundations that had very few such assumptions built in. It has some that give us headaches. So far, no black holes. (The hardest part of my job over the last 15 years is to try to prevent such assumptions from creeping in).

Lots of folks also make the mistake of assuming that if they can parse (and perhaps get an AST), they are well on the way to doing something complicated. One of the hard lessons is that you need symbol tables and flow analysis to do good program analysis or transformation. ASTs are necessary but not sufficient. This is the reason that Aho&Ullman's compiler book doesn't stop at chapter 2. (The OP has this right in that he is planning to build additional machinery beyond the AST).

The remark about "I don't need a perfect translation" is troublesome. What weak translators do is convert the "easy" 80% of the code, leaving the hard 20% to do by hand. If the applications you intend to convert are pretty small, well, then that 20% is OK. If you attempt to convert 100K SLOC then 20% is 20,000 original lines of code that are hard to translate to understand and modify in the context of another 80,000 lines of program you don't understand. That takes a huge amount of effort. At the million line level, this is simply impossible in practice.

What you have to shoot for to translate large-scale systems is high nineties percentage conversion rates, or it is likely that you can't complete the manual part of the translation activity.

I consider our tools to be extremely good (but then, I'm pretty biased). And it is still very hard to build a good translator. The difference is that with this much machinery, we succeed considerably more often than we fail.

Ira Baxter
Thank you, that's a very insightful answer. I am aware that it'll be difficult. I might be underestimating it, but I am planning on having some functional translation *from* Python to PHP in a couple of months or so. I know even parsing PHP is notoriously hard, but I'll cross that bridge when I get to it.
NullUserException
Have you ever considered contributing your "painfully built" PHP definition back to the PHP community at large, or is to too closely associated with your own revenue stream to make that feasible?
TML
I've been asked to make everything we do "open source" by lots of folks that didn't want to contribute to a revenue stream, and didn't have the energy to do the work and open source it themselves. If you only contribute a small part to a very big project, and/or you have another source of income, "open source" seems fine. If you've done all the work yourself and its your only source of income, this is a lot less attractive. [I don't want to get into a discussion about the relative merits of "free software" philosophy, so I won't participate in any futher comments along this line]
Ira Baxter
I agree with what you said here, which is why I phrased the question as I did. I guess we are to intuit from that response that you feel it's too closely tied to your revenue, and there's absolutely nothing wrong with that - I just thought it worth asking.
TML
+5  A: 

My answer will address the specific task of parsing Python in order to translate it to another language, and not the higher-level aspects which Ira addressed well in his answer.

In short: do not use the parser module, there's an easier way.

The ast module, available since Python 2.6 is much more suitable for your needs, since it gives you a ready-made AST to work with. I've written an article on this last year, but in short, use the parse method of ast to parse Python source code into an AST. The parser module will give you a parse tree, not an AST. Be wary of the difference.

Now, since Python's ASTs are quite detailed, given an AST the front-end job isn't terribly hard. I suppose you can have a simple prototype for some parts of the functionality ready quite quickly. However, getting to a complete solution will take more time, mainly because the semantics of the languages are different. A simple subset of the language (functions, basic types and so on) can be readily translated, but once you get into the more complex layers, you'll need heavy machinery to emulate one language's core in another. For example consider Python's generators and list comprehensions which don't exist in PHP (to my best knowledge, which is admittedly poor when PHP is involved).

To give you one final tip, consider the 2to3 tool created by the Python devs to translate Python 2 code to Python 3 code. Front-end-wise, it has most of the elements you need to translate Python to something. However, since the cores of Python 2 and 3 are similar, no emulation machinery is required there.

Eli Bendersky
Weeeell. `2to3` is just AST to AST. It doesn't support doing anything that goes beyond the capabilities of the `ast` module. Notice that all of the translations go *from* syntax supported by the host python process *to* syntax supported by the host python process. There's no translator that does adds, say, function annotations, because 2.6 doesn't support it.
Aaron Gallagher
... and the OP's question might be framed, short term, how to get from Python 2.6 AST to ... something in PHP. The ast module likely won't want to represent the PHP syntax well, so its not even ast to ast.
Ira Baxter
@Aaron: `2to3` can be seen as an example of using the AST generated from `ast`.
Eli Bendersky
@Eli: AFAIK, 2to3 arguably is an easier translation than Python to PHP (after all, its Python to Python, right)? And even it doesn't work particularly well. Notice the large amount of Python 2.6 that hasn't been shoved through 2to3 yet... because there's apparantly a bunch of post translation hand patching that still has to be done. If were 100% automated, Python 2.6 would be dead.
Ira Baxter
+1  A: 

There are a couple answers telling you not to bother. Well, how helpful is that? You want to learn? You can learn. This is compilation. It just so happens that your target language isn't machine code, but another high-level language. This is done all the time.

There's a relatively easy way to get started. First, go get http://sourceforge.net/projects/lime-php/ (if you want to work in PHP) or some such and go through the example code. Next, you can write a lexical analyzer using a sequence of regular expressions and feed tokens to the parser you generate. Your semantic actions can either output code directly in another language or build up some data structure (think objects, man) that you can massage and traverse to generate output code.

You're lucky with PHP and Python because in many respects they are the same language as each other, but with different syntax. The hard part is getting over the semantic differences between the grammar forms and data structures. For example, Python has lists and dictionaries, while PHP only has assoc arrays.

The "learner" approach is to build something that works OK for a restricted subset of the language (such as only print statements, simple math, and variable assignment), and then progressively remove limitations. That's basically what the "big" guys in the field all did.

Oh, and since you don't have static types in Python, it might be best to write and rely on PHP functions like "python_add" which adds numbers, strings, or objects according to the way Python does it.

Obviously, this can get much bigger if you let it.

Ian
Actually, I didn't say "don't bother". What I said was, "translating languages in a general ways is very hard".If the OP proceeds down his original path of using Python trees to try and generate PHP, he will learn a lot and I'm all in favor of the learning experience; I started there, too. He won't be able to add new languages easily.
Ira Baxter
A: 

You could take a look at the Vala compiler, which translates Vala (a C#-like language) into C.

ptomato
+2  A: 

Writing a translator isn't impossible, especially considering that Joel's Intern did it over a summer.

If you want to do one language, it's easy. If you want to do more, it's a little more difficult, but not too much. The hardest part is that, while any turing complete language can do what another turing complete language does, built-in data types can change what a language does phenomenally.

For instance:

word = 'This is not a word'
print word[::-2]

takes a lot of C++ code to duplicate (ok, well you can do it fairly short with some looping constructs, but still).

That's a bit of an aside, I guess.

Have you ever written a tokenizer/parser based on a language grammar? You'll probably want to learn how to do that if you haven't, because that's the main part of this project. What I would do is come up with a basic Turing complete syntax - something fairly similar to Python bytecode. Then you create a lexer/parser that takes a language grammar (perhaps using BNF), and based on the grammar, compiles the language into your intermediate language. Then what you'll want to do is do the reverse - create a parser from your language into target languages based on the grammar.

The most obvious problem I see is that at first you'll probably create horribly inefficient code, especially in more powerful* languages like Python.

But if you do it this way then you'll probably be able to figure out ways to optimize the output as you go along. To summarize:

  • read provided grammar
  • compile program into intermediate (but also Turing complete) syntax
  • compile intermediate program into final language (based on provided grammar)
  • ...?
  • Profit!(?)

*by powerful I mean that this takes 4 lines:

myinput = raw_input("Enter something: ")
print myinput.replace('a', 'A')
print sum(ord(c) for c in myinput)
print myinput[::-1]

Show me another language that can do something like that in 4 lines, and I'll show you a language that's as powerful as Python.

Wayne Werner
"Have you ever written a tokenizer/parser based on a language grammar? " I have done it using JavaCC.
NullUserException
Joel's intern did a partial job over a summer. His source language was a subset of an existing language, and presumably this subset could be adjusted somewhat. That makes the job a lot easier. Similarly, NullPointerException might want to start with the easier parts of Python, perhaps passing through the harder stuff for manual conversion (as noted in the questions).
David Thornley
@NullUserException: You'll have some exposure, but you'll basically be doing a re-implementation of JavaCC, only instead of Java as the output language, you'll be doing <insert langauge here>. @David, quite so. Even Thistle needs some help on some of the language constructs. If I were the OP, I'd go for functional first, then optimize, otherwise I'd be stuck forever trying to get C++ to do string slicing (with steps) :p
Wayne Werner