views:

65

answers:

3

I'd like to get a solid understanding of the low level process for representing and running a program. I've decided to do this by writing a program to parse and display object file information (headers, sections, etc.). I've nearly finished this part. A natural extension is to decompile the remaining relevant data into assembly instructions. Initially, I'll focus on x86.

Where can I find resources related to this decompilation (binary -> ASM)? I've read that x86 has a one to one correspondence to ASM, although I do not know the best reference from which to pull the translation tables.

Also, while I'm at it, I'd be interested in tracking any supplied debugging information. Are there any references on the format used for this information (lets assume ELF and GCC with -g option)?

Do any of you have any general advice? The goal here is a hands-on project to increase my understanding.

+1  A: 

Check out my virtual machine tutorial: http://www.icemanind.com

It teaches you step-by-step how to build your own virtual machine and assembler. By reading and completing the program in the tutorial, you will have a more firm grasp on how programs, assembly language and binary works

icemanind
+1  A: 

You can find a meticulously documented python disassembler for the 8086 at google-code:http://code.google.com/p/dasm3/

transistorski
+1  A: 

x86 is variable instruction length, which means very difficult to disassemble. Not advisable if this is your first disassembler.

Saying that...the approach I take is that you have to identify in the binary the bytes that are the first byte of an opcode and separate those from bytes that are second or other bytes in the opcode or data. Once you know that you can start at the beginning of the binary and disassemble the opcodes.

How do yo figure out opcodes from other bytes? You need to walk all possible execution paths, sounds like a recursion problem, and could be but doesnt have to be. Look at the interrupt vector table and/or all hardware entry points in to the code. That gives you a short list of opcode bytes. A non-recursion approach is to make many passes over the binary looking at each byte that is marked an opcode, decode it just enough to know how many bytes it consumes. You also need to know if it is an unconditional branch, conditional branch, return, call, etc. If it is not an unconditional branch or return you can assume the byte after this instruction is the first byte of the next instruction. Any time you encounter a branch or call of some sort, compute the destination address, add that byte to the list. Keep making passes until you have made a pass that adds no new bytes to the list. You also need to make sure that if say you find a byte that is a 3 byte instruction, but the byte after it is marked as an instruction, then you have a problem. Things like conditional branches that are preceeded by something that insures they will never branch. You dont see this much if at all with high level code compiled to a binary, but the good old days of hand written assembler, or folks that want to protect their code will do things like this.

Unfortunately if all you have is the binary, for a variable length instruciton set, you wont get a perfect disassembly. Some branch destinations are computed at runtime, sometimes hand coded assembly will modify the stack before doing a return to change what code executes next, if that is the only path to that code then you likely wont figure it out programmatically unless you go so far as to simulate the code. And even with simulation you wont cover all execution paths.

With a fixed length instruction set like an ARM for example (so long as it is arm and not a mixture of arm and thumb) you can simply start at the beginning of the binary and disassemble until you run out of words. You might disassemble a data word into a valid or invalid or unlikely to be used instruction, but that is fine.

I wouldnt be surprised if somewhere in the elf there is something that indicates what parts of the binary are executable and what parts are data. maybe even so much that you dont have to walk the data paths, I doubt objdump performs a task like that it probably uses something in the elf file.

The elf file format is documented in many places. There is the basic structure and vendors may add specific block types which would be documented by the vendor.

dwelch
great answer! thanks very much for taking the time to contribute
Willi Ballenthin