views:

76

answers:

3

So is a decompiler really a thing that gives gives the source of a compiled/interpreted piece of code? Because to me that sounds impossible. How would you get the names of the functions, variables, classes, etc if it is compiled. Or am I misinterpreting the definition? How does it work? And what is the general principal behind making one?

+2  A: 

You're right about your definition of a decompiler: it takes a compiled application and produces source code to match. However, it does not in most cases know the names of variables/functions/classes--it just guesses. It analyzes the flow of the program and tries to find a way to represent that flow through a certain programming language, typically C. However, because the programming language of choice (C, in this example) is often at a higher level than the state of the underlying program (a binary executable), some parts of the program might be impossible to represent accurately; in this case, the decompiler would fail and you would need to use a disassembler.


Per your followup: making a decompiler is not a simple task. Basically, you have to take the application that you are decompiling (be it an executable or some other form of compiled application) and parse it into some kind of tree you can work with in memory. You would then analyze the flow of the program and try to find patters that might suggest that a variable was used in a certain location in the code--do the same for functions, classes, etc. It's all really just a guessing game: you'd have to know the patterns that the compiler makes in compiled code, then search for those patterns and replace them with equivalent human-readable source code.

This is all much simpler for higher-level programs like Java or .NET, where you don't have to deal with assembly instructions, and things like variables are mostly taken care of for you. There, you don't have to guess as much as just directly translate. However, on a low level (with a C decompiler, for example), there really is a lot of guesswork involved, and if the decompiler cannot properly guess the way something was done in the source, it fails and cannot continue. This is why many people like to obfuscate their code: it makes it much harder for decompilers to open it.

Disclaimer: I have never written a decompiler and thus don't know every detail of what I'm talking about. If you are really interested in writing a decompiler, you should get a book on the topic--it will help you more than anyone on SO can.

musicfreak
What is the general principal behind making one though?
thyrgle
Actually, you can often get the variable names for Java and unstripped gcc debug-ready executables.
paxdiablo
@paxdiablo: Yep, that's why I made sure to include "in most cases" in there, because there are a few cases where you *can* actually get source code that looks almost identical to the original. :)
musicfreak
I'm wondering "how do you make one". Sorry, I'm kinda slow.
thyrgle
@thyrgle: I edited my answer and gave you a more detailed explanation. Please note, though, that I've never written a decompiler and I don't know enough to actually get you started in writing one, but I know enough to give you an overall idea of how they work.
musicfreak
@musicfreak: Ok, thanks!!!
thyrgle
Yep, no problem! :)
musicfreak
A: 

A decompiler basically takes the machine code and reverts it back to the language it was formatted in. If I'm not mistaken, I think the decompiler needs to know what language it was compiled in, otherwise it won't work.

The basic purpose of the decompiler is to get back to your source code; for example, one time my Java file got corrupted and the only thing I could so to bring it back was by using a decompiler (since the class file wasn't corrupted).

DDP
A: 

It works by deducing a "reasonable" (based on some heuristics) representation of what's in the object code. The degree of resemblance between what it produces and what was originally there tends to depend heavily upon how much information is contained in binary it starts from. If you start with basically a "pure" binary, it's generally stuck with just making up "reasonable" names for the variables, such as using things like i, j and k for loop indexes, and longer names for most others.

On the other hand, a language that supports introspection needs to embed a great deal more information about variable names, types, etc., into the executable. In a case like this, decompiling can produce something much closer to the original, such as typically retaining the original names for functions, variables, etc. In such a case, the decompiler can often produce something quite similar to the original -- possibly losing little more than formatting and comments.

Jerry Coffin