tags:

views:

1895

answers:

23
+17  Q: 

Learning assembly

I decided to learn the Assembly language. The main reason to do so resides in being able to understand deassembled code and maybe being able to write more efficient parts of code(for example, through c++), doing somethings like code caves, etc. I saw there are a zillion different flavors of assembly, so, for the purposes I mention, how should I start? What kind of assembly should I learn? I want to learn by first doing some easy programs (i.e. a calculator), but the goal itself will be to get accostumed with it so I can understand the code shown, for example, by IDA Pro.

I'm using windows (if that makes any difference).

Thanks

edit: So, seems everyone is pointing towards MASM. Although I get the point that it has high level capabilities, all good for the assembly code programmer, that's not what I'm looking for. It seems to have if, invoke, etc instructions not shown in popular deassemblers (like IDA). So what I'd like to hear if possible, is the opinion of anyone that uses ASM for the purposes I am asking (reading deassembled exe's code in IDA), not just "general" assembly programmers.

edit: OK. I am already learning assembly. I am learning MASM, not using the high level stuff that doesn't matter to me. What I'm doing right now is trying out my code on __asm directives in c++, so I can try out things way faster than if I had to do everything from scratch with MASM.

+9  A: 

Start with MASM32 and from there look at FASM. But you'll have fun with MASM.

Noon Silk
I've heard from MASM. If I'm not mistaken, it has a lot of "high level" features, that I don't see when I look at dissambled code. I'd like to have to program in something that is exactly like most disassemblers output code, if this is making sense.
devoured elysium
That would basically be like writing op codes, which doesn't really make sense. Learning MASM32 will help you understand how code looks in a debugger. You may also like to check out OllyDbg: http://www.ollydbg.de/
Noon Silk
A lot of those "high-level" features are there for a reason. It's not easy writing in assembly language, so every little bit helps. You could also learn to write in machine code, emitting 32-bit ints into an EXE file instead of using assembly language and a compiler/linker, but there's really no reason to. MASM (or TASM, if you can find a copy of it) are good places to start.
Michael Todd
My point is to learn to understand code output by debuggers, that from what I understood is in the form of opcodes, right? Then I'd like to learn to program with them. As I said, I do not intend to do full programs with assembly, I just want to write a couple to get a feel with the language.
devoured elysium
You don't understand assembly. You need to understand it. An opcode is a number. Debuggers will attempt to resolve opcodes to their instructions (sometimes its hard). You need to understand the basic instructions. Learning MASM will help you do this. No more needs to be said.
Noon Silk
You don't have to use all of the MASM features just because they're there; you can make things as hard to read as you want, if you think you'll learn more that way.
JasonTrue
+11  A: 

The assembly you would write by hand and the assembly generated by a compiler are often very different when viewed from a high level. Of course, the innards of the program will be very similar (there are only so many different ways to encode a = b + c, after all), but they're not the trouble when you're trying to reverse engineer something. The compiler will add a ton of boilerplate code to even simple executables: last time I compared, "Hello World" compiled by GCC was about 4kB, while if written by hand in assembly it's around 100 bytes. It's worse on Windows: last time I compared (admittedly, this was last century) the smallest "Hello World" I could get my Windows compiler of then-choice to generate was 52kB! Usually this boilerplate is only executed once, if at all, so it doesn't much affect program speed -- like I said above, the core of the program, the part where most execution time is spent, is usually pretty similar whether compiled or written by hand.

At the end of the day, this means that an expert assembly programmer and an expert disassembler are two different specialties. Commonly they're found in the same person, but they're really separate, and learning how to be an excellent assembly coder won't help you that much to learn reverse engineering.

What you want to do is grab the IA-32 and AMD64 (both are covered together) architecture manuals from Intel and AMD, and look through the early sections on instructions and opcodes. Maybe read a tutorial or two on assembly language, just to get the basics of assembly language down. Then grab a small sample program that you're interested in and disassemble it: step through its control flow and try to understand what it's doing. See if you can patch it to do something else. Then try again with another program, and repeat until you're comfortable enough to try to achieve a more useful goal. You might be interested in things like "crackmes", produced by the reverse engineering community, which are challenges for people interested in reverse engineering to try their hand at, and hopefully learn something along the way. They range in difficulty from basic (start here!) to impossible.

Above all, you just need to practice. As in many other disciplines, with reverse engineering, practice makes perfect... or at least better.

kquinn
I know that when you compile anything with a high level language, you will get a lot of "garbage" code that wouldn't be needed it it was coded directly in assembly. I also do understand that there's a difference between an expert assembly programmer and expert disassembler. But the same could be said about almost everything else.
devoured elysium
My concern is that while in theory I could read the papers and get a grasp of what they mean, until I start writting things myself I don't believe I'll truly understand it. You say I can start by changing small parts of code, but to do that I first must know what kind of assembly "flavour" IDA pro, for example, uses.
devoured elysium
Also, what does MSVC++ use for the inline assembly code? MASM?
devoured elysium
+3  A: 

I found Hacking: The Art of Exploitation to be an interesting and useful way into this topic... can't say that I have ever used the knowledge directly, but that's really not why I read it. It gives you a much richer appreciation of the instructions that your code compiles to, which has occasionally been useful in understanding subtler bugs.

Don't be put off by the title. Most of the first part of the book is "Hacking" in the Eric Raymond sense of the word: creative, surprising, almost sneaky ways to solve tough problems. I (and maybe you) was a lot less interested in the security aspects.

mblackwell8
+2  A: 

I think you want to learn the ASCII-ized opcode mnemonics (and their parameters), which are output by a disassembler and which are understood by (can be used as input to) an assembler.

Any assembler (e.g. MASM) would do.

And/or it might be better for you to read a book about it (there have been books recommended on SO, I don't remember which).

ChrisW
+3  A: 

I started out learning MIPS which is a very compact 32-bit architecture. It is a reduced instruction set, but that's what makes easy to grasp for beginners. You will still be able to understand how assembly works without getting overwhelmed with complexity. You can even download a nice little IDE, which will allow you to compile your MIPS code: clicky Once you get the hang of it, i think it would be much easier to move on to more complex architectures. At least that's what i thought :) At this point you will have the essential knowledge of memory allocation and management, logic flow, debugging, testing and etc.

Sergey
+4  A: 

I wouldn't focus on trying to write programs in assembly, at least not at first. If you're on x86 (which I assume you are, since you're using Windows), there are tons of weird special cases that it's kind of pointless to learn. For example, many instructions assume you're operating on a register that you don't explicitly name, and other instructions work on some registers but not others.

I would learn just enough about your intended architecture that you understand the basics, then just jump right in and try to understand your compiler's output. Arm yourself with the Intel manuals and just dive right into your compiler's output. Isolate the code of interest into a small function, so you can be sure to understand the entire thing.

I would consider the basics to be:

  • registers: how many are there, what are their names, and what are their sizes?
  • operand order: add eax, ebx means "Add ebx to eax and store the result in eax".
  • FPU: learn the basics of the floating-point stack and how you convert to/from fp.
  • addressing modes: [base + offset * multiplier], but multiplier can only be 1, 2, or 4 (or maybe 8?)
  • calling conventions: how are parameters passed to a function?

A lot of the time it will be surprising what the compiler emits. Make it a puzzle of figuring out why the heck the compiler thought this would be a good idea. It will teach you a lot.

It will probably also help to arm yourself with Agner Fog's manuals, especially the instruction listing one. It will tell you roughly how expensive each instruction is, though this is harder to directly quantify on modern processors. But it will help explain why, for example, the compiler goes so far out of its way to avoid issuing an idiv instruction.

My only other piece of advice is to always use Intel syntax instead of AT&T when you have a choice. I used to be pretty neutral on this point, until the day I realized that some instructions are totally different between the two (for example, movslq in AT&T syntax is movsxd in Intel syntax). Since the manuals are all written using Intel syntax, just stick with that.

Good luck!

Josh Haberman
+7  A: 

I have done this many times and continue to do this. In this case where your primary goal is reading and not writing assembler I feel this applies.

Write your own disassembler. Not for the purpose of making the next greatest disassembler, this one is strictly for you. The goal is to learn the instruction set. Whether I am learning assembler on a new platform, remembering assembler for a platform I once knew. Start with only a few lines of code, adding registers for example, and ping pong-ing between disassembling the binary output and adding more and more complicated instructions on the input side you:

1) learn the instruction set for the specific processor

2) learn the nuances of how to write code in assemble for said processor such that you can wiggle every opcode bit in every instruction

3) you learn the instruction set better that most engineers that use that instruction set to make their living

In your case there are a couple of problems, I normally recommend the ARM instruction set to start with, there are more ARM based products shipped today than any other (x86 computers included). But the likelihood that you are using ARM now and dont know enough assembler for it to write startup code or other routines knowing ARM may or may not help what you are trying to do. The second and more important reason for ARM first is because the instruction lengths are fixed size and aligned. Disassembling variable length instructions like x86 can be a nightmare as your first project, and the goal here is to learn the instruction set not to create a research project. Third ARM is a well done instruction set, registers are created equal and dont have individual special nuances.

So you will have to figure out what processor you want to start with. I suggest the msp430 or ARM first, then ARM first or second then the chaos of x86. No matter what platform, any platform worth using has data sheets or programmers reference manuals free from the vendor that include the instruction set as well as the encoding of the opcodes (the bits and bytes of the machine language). For the purpose of learning what the compiler does and how to write code that compiler doesnt have to struggle with it is good to know a few instruction sets and see how the same high level code is implemented on each instruction set with each compiler with each optimization setting. You dont want to get into optimizing your code only to find that you have made it better for one compiler/platform but much worse for every other.

Oh for disassembling variable length instruction sets, instead of simply starting at the beginning and disassembling every four byte word linearly through memory as you would with the ARM or every two bytes like the msp430 (The msp430 has variable length instructions but you can still get by going linearly through memory if you start at the entry points from the interrupt vector table). For variable length you want to find an entry point based on a vector table or knowledge about how the processor boots and follow the code in execution order. You have to decode each instruction completely to know how many bytes are used then if the instruction is not an unconditional branch assume the next byte after that instruction is another instruction. You have to store all possible branch addresses as well and assume those are the starting byte addresses for more instructions. The one time I was successful I made several passes through the binary. Starting at the entry point I marked that byte as the start of an instruction then decoded linearly through memory until hitting an unconditional branch. All branch targets were tagged as starting addresses of an instruction. I made multiple passes through the binary until I had found no new branch targets. If at any time you find say a 3 byte instruction but for some reason you have tagged the second byte as the beginning of an instruction you have a problem. If the code was generated by a high level compiler this shouldnt happen unless the compiler is doing something evil, if the code has hand written assembler (like say an old arcade game) it is quite possible that there will be conditional branches that can never happen like r0=0 followed by a jump if not zero. You may have to hand edit those out of the binary to continue. For your immediate goals which I assume will be on x86 I dont think you will have a problem.

I recommend the gcc tools, mingw32 is an easy way to use gcc tools on Windows if x86 is your target. If not mingw32 plus msys is an excellent platform for generating a cross compiler from binutils and gcc sources (generally pretty easy). mingw32 has some advantages over cygwin, like significantly faster programs and you avoid the cygwin dll hell. gcc and binutils will allow you to write in C or assembler and disassemble your code and there are more web pages than you can read showing you how to do any one or all of the three. If you are going to be doing this with a variable length instruction set I highly recommend you use a tool set that includes a disassembler. A third party disassembler for x86 for example is going to be a challenge to use as you never really know if it has disassembled correctly. Some of this is operating system dependent too, the goal is to compile the modules to a binary format that contains information marking instructions from data so the disassembler can do a more accurate job. Your other choice for this primary goal is to have a tool that can compile directly to assembler for your inspection then hope that when it compiles to a binary format it creates the same instructions.

The short (okay slightly shortER ) answer to your question. Write a disassembler to learn an instruction set. I would start with something RISCy and easy to learn like ARM. Once you know one instruction set others become much easier to pick up, often in a few hours, by the third instruction set you can start writing code almost immediately using the datasheet/reference manual for the syntax. All processors worth using have a datasheet or reference manual that describes the instructions down to the bits and bytes of the opcodes. Learn a RISC processor like ARM and a CISC like x86 enough to get a feel for the differences, things like having to go through registers for everything or being able to perform operations directly on memory with fewer or no registers. Three operand instructions versus two, etc. As you tune your high level code, compile for more than one processor and compare the output. The most important thing you will learn is that no matter how good the high level code is written the quality of the compiler and the optimization choices made make a huge difference in the actual instructions. I recommend llvm and gcc (with binutils), neither produce great code, but they are multi platform and multi target and both have optimizers. And both are free and you can easily build cross compilers from sources for various target processors.

dwelch
Thanks for the reply. But I don't even know how to write an disassembler.
devoured elysium
"Write your own disassembler" - I agree, it's how I learned it best. (What's up with "But I don't even know how to write an disassembler"?) LOL.
slashmais
+5  A: 

I'll go against the grain of most answer and recommend Knuth's MMIX variant of the MIPS RISC architecture. It won't be as practically useful as x86 or ARM assembly languages (not that they're all that crucial themselves in most real-life jobs these days...;-), but it WILL unlock for you the magic of Knuth's latest version of the greatest-ever masterpiece on deep low-level understanding of algorithms and data structures -- TAOCP, "The Art of Computer Programming". The links from the two URLs I've quoted are a great way to start exploring this possibility!

Alex Martelli
+3  A: 
Nick D
I had to hit ctrl-c, before I could enter "g."
ericp
Nick D
+2  A: 

Art of Assembly Language - have fun ;)

oh, and here

Phil
+1  A: 

Are you doing other dev work on windows? On which IDE? If it's VS, then there's no need for an additional IDE just to read disassembled code: debug your app (or attach to an external app), then open the disassembly window (in the default settings, that's Alt+8). Step and watch memory/registers as you would through normal code. You might also want to keep a registers window open (Alt+5 by default).

Intel gives free manuals, that give both a survey of basic architecture (registers, processor units etc.) and a full instruction reference. As the architecture matures and is getting more complex, the 'basic architecture' manuals grow less and less readable. If you can get your hands on an older version, you'd probably have a better place to start (even P3 manuals - they explain better the same basic execution environment).

If you care to invest in a book, here is a nice introductory text. Search amazon for 'x86' and you'd get many others. You can get several other directions from another question here.

Finally, you can benefit quite a bit from reading some low-level blogs. These byte-size info bits work best for me, personally.

Ofek Shilon
+2  A: 

This will not necessarily help you write efficient code!

i86 op codes are more or less a "legacy" format that persists because of the sheer volume of code and executable binaries for Windows and Linux out there.

Its a bit like the old scholars writing in latin, an Italian speaker like Galileo would write in Latin and his paper could be understood by a Polish speaker like Copernicus. This was still the most effective way to communicate even though niether was particulary good at Latin, and Latin is a rubbish language for expressing mathematical ideas.

So compilers generate x86 code by default, and, modern chips read the anceint Op codes and transalate what they see into parallel risc instructions, with reordered execution, speculative execution, pipelining etc. plus they make full use of the 32 or 64 registers the processor actually has (as opposed to the pathetic 8 you see in x86 instructions.)

Now all optimising compilers know this is what really happens, so they code up sequences of OP codes which they know the chip can optimise efficiently -- even though some of these sequences would look innefficient to an circa 1990 .asm programmer.

At some point you need to accept that the 10s of thousands of man years effort compiler writers have put in have paid off, and, trust them.

The simplest and easiest way to get a more eficient runtime is to buy the Intel C/C++ compiler. They have a niche market for efficeint compilers, and, they have the advantage of being able to ask the chip designers about what goes on inside.

James Anderson
+2  A: 

To do what you're wanting to do, I just took the Intel Instruction Set Reference (might not be the exact one I used, but it looks sufficient) and some simple programs I wrote in Visual Studio and started throwing them into IDAPro/Windbg. When I out-grew my own programs, the software at crackmes was helpful.

I'm assuming that you have some basic understanding of how programs execute on Windows. But really, for reading assembly, there's only a few instructions to learn and a few flavors of those instructions (e.g., there's a jump instruction, jump has a few flavors like jump-if-equal, jump-if-ecx-is-zero, etc). Once you learn the basic instructions it's pretty simple to get the gist of the program execution. IDA's graph view helps, and if you're tracing the program with Windbg, it's pretty simple to figure out what the instructions are doing if you're not sure.

After a bit of playing like that, I bought Hacker Disassembly Uncovered. Generally, I stay away from books with the word "Hacker" in the title, but I really liked how this one went really in-depth about how compiled code looked disassembled. He also goes into compiler optimizations and some efficiency stuff that was interesting.

It all really depends on how deeply you want to be able to understand the program, too. If you're reverse engineering a target looking for vulnerabilities, if you're writing exploit code, or analyzing packed malware for capabilities, you'll need more of a ramp-up time to really get things going (especially for the more advanced malware). On the other hand, if you just want to be able to change your character's level on your favorite video game, you should be doing fine in a relatively short amount of time.

mrduclaw
A: 

Lots of good answers here. Low-level programming, assembly etc are popular in the security community, so it is worthwhile looking for hints and tips there once you get going. They even have some good tutorials like this one on x86 assembly.

BrianLy
A: 

To actually reach your goal, you might consider starting with the IDE you are in. The generally is a disassembler window, so you can do single stepping through code. There is usually a view of some sort to let you see the registers and look into memory areas.

Examination of unoptimized c/c++ code will help build a link into the kind of code that the compiler generates for your sources. Some compilers have some sort of ASM reserved word which lets you insert machine instructions in your code.

My advice would be to play around with those sorts of tools for a while and get your feet wet, then step up? down? to straight assembler code on what ever platform you are running on.

There are a lot of great tools out there, but you might find it more fun, to avoid the steep learning curve at first.

EvilTeach
A: 

We learned assembly with a microcontroller development kit (Motorola HC12) and a thick datasheet.

espais
+2  A: 
Frank V
+3  A: 

The suggestion to use debug is a fun one, many neat tricks can be done with that. However, for a modern operating system, learning 16bit assembly may be slightly less useful. Consider, instead, using ntsd.exe. It's built into Windows XP (it was yanked in Server 2003 and above, unfortunately), which makes it a convenient tool to learn since it's so widely available.

That said, the original version in XP suffers from a number of bugs. If you really want to use it (or cdb, or windbg, which are essentially different interfaces with the same command syntax and debugging back-end), you should install the free windows debugging tools package.

The debugger.chm file included in that package is especially useful when trying to figure out the unusual syntax.

The great thing about ntsd is you can pop it up on any XP machine you're near and use it to assembly or disassemble. It makes a /great/ X86 assembly learning tool. For example (using cdb since it's inline in the dos prompt, it's otherwise identical):

(symbol errors skipped since they're irrelevant -- also, I hope this formatting works, this is my first post)

C:\Documents and Settings\User>cdb calc

Microsoft (R) Windows Debugger Version 6.10.0003.233 X86
Copyright (c) Microsoft Corporation. All rights reserved.

CommandLine: calc
Symbol search path is: *** Invalid ***
Executable search path is:
ModLoad: 01000000 0101f000   calc.exe
ModLoad: 7c900000 7c9b2000   ntdll.dll
ModLoad: 7c800000 7c8f6000   C:\WINDOWS\system32\kernel32.dll
ModLoad: 7c9c0000 7d1d7000   C:\WINDOWS\system32\SHELL32.dll
ModLoad: 77dd0000 77e6b000   C:\WINDOWS\system32\ADVAPI32.dll
ModLoad: 77e70000 77f02000   C:\WINDOWS\system32\RPCRT4.dll
ModLoad: 77fe0000 77ff1000   C:\WINDOWS\system32\Secur32.dll
ModLoad: 77f10000 77f59000   C:\WINDOWS\system32\GDI32.dll
ModLoad: 7e410000 7e4a1000   C:\WINDOWS\system32\USER32.dll
ModLoad: 77c10000 77c68000   C:\WINDOWS\system32\msvcrt.dll
ModLoad: 77f60000 77fd6000   C:\WINDOWS\system32\SHLWAPI.dll
(f2c.208): Break instruction exception - code 80000003 (first chance)
eax=001a1eb4 ebx=7ffd6000 ecx=00000007 edx=00000080 esi=001a1f48 edi=001a1eb4
eip=7c90120e esp=0007fb20 ebp=0007fc94 iopl=0         nv up ei pl nz na po nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000202
ntdll!DbgBreakPoint:
7c90120e cc              int     3
0:000> r eax
eax=001a1eb4
0:000> r eax=0
0:000> a eip
7c90120e add eax,0x100
7c901213
0:000> u eip
ntdll!DbgBreakPoint:
7c90120e 0500010000      add     eax,100h
7c901213 c3              ret
7c901214 8bff            mov     edi,edi
7c901216 8b442404        mov     eax,dword ptr [esp+4]
7c90121a cc              int     3
7c90121b c20400          ret     4
ntdll!NtCurrentTeb:
7c90121e 64a118000000    mov     eax,dword ptr fs:[00000018h]
7c901224 c3              ret
0:000> t
eax=00000100 ebx=7ffd6000 ecx=00000007 edx=00000080 esi=001a1f48 edi=001a1eb4
eip=7c901213 esp=0007fb20 ebp=0007fc94 iopl=0         nv up ei pl nz na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000206
ntdll!DbgUserBreakPoint+0x1:
7c901213 c3              ret
0:000>`

Also -- while you're playing with IDA, make sure to check out the IDA Pro Book by Chris Eagle (unlinked since StackOverflow doesn't want to let me post more than two links for my first post). It's hands-down the best reference out there.

Jordan
+1 for Chris Eagle's book. Gotta put some love in there for the Sk3wl of r00t ;)
mrduclaw
+1  A: 

One of the standard pedagogic assembly languages out there is MIPS. You can get MIPS simulators(spim) and various teaching materials for it.

Personally, I'm not a fan. I rather like IA32.

Paul Nathan
MIPS is nice. 68000 is, too, and if you learn 68000 you can write binaries that run in MAME. :-)
Nosredna
A: 

Off topic I know, but since you are a Windows programmer I can't help but think that it may be a more appropriate and/or better use of your time to learn MSIL. No, it's not assembly, but it's probably more relevant in this .NET era.

slf
A: 

Knowing assembly can be useful for debugging but I wouldn't get too excited about using it for optimizing your code. Modern compilers are usually much better at optimizing that a human these days.

Adam Pierce
Hmm. You can still wring out quite a bit extra coding assembly yourself, but it takes more work to beat the compiler than it used to.
Nosredna
A: 

My personal favorite is NASM, mostly because it's multi-platform, and it compiles MMX, SSE, 64-bit...

I started compiling some simple C source file with gcc, and "trans-coding" the assembler instruction from gcc-format into NASM-format. Then you can change small portions of code, and verify the performance improvement it brings.

The NASM documentation is really complete, I never needed to search for information from books, or other sources.

G B
+1  A: 

Some links you might find useful to learn the assembly - source code mapping -

Assembly And The Art Of Debugging

Debugging – Modifying Code At Runtime

Hope you find these useful.

tc