views:

278

answers:

5

Everyday we are reading through large amounts of code not well documented.

Understanding the code by jumping into it might not be easy for everyone until they have an idea about the code structure.

If you have to reverse engineer code and want to bring it into some diagrammatic form, what steps do you follow? I know there are tools, like Doxygen, but it would be used at one step of the 'process' of reverse engineering approach.

Is there a nice approach like x then y and then z. Tada!

Can you tell me a process to zoom out of the code maze to a level where we get a high-level view of the system and/or subsystem?

A: 

I use a "poke from the top, poke from the bottom till I reach the middle" approach.

I usually try to get some information about what the code/the system is intended to do. More than often, the original developer is not available but talking with users and asking them about what the system is all about usually provides a good starting point.

Next I try to get an overview about the technology stack used:

  1. Target Platform, application type (library, web, rich-client etc.)
  2. Referenced Libraries (OSS/Third-Pary, what do they do)
  3. Infrastructure Code (usually reveals one of the dominant concepts of the domain)

This is what I do to get an overview of the application foundation. Next I try to look at the public API (or GUI if relevant), looking for classes often referenced/used, revealing the points of connection in the system.

Particular helpful to find out about coupling between classes is NDepend, but as far as I know it is only available for .NET.

If real reverse engineering (without access to the source code) is what you need to do there are three things:

  1. Learn Assembler
  2. Learn how to use winDbg and a disassembler
  3. Learn reading Assembler :-) (learn dividing between code and data)

For .NET and Java, learning the Intermediate Languages is better than going down to machine language, tools like Reflector can help reverse engineering here but are not very helpful if the code is obfuscated.

Johannes Rudolph
Is someone watering this post? It is sprouting roots! ;) +1
Aiden Bell
+2  A: 

One approach you can try is to use one of UML tools that have support for you particular programming language and can do "reverse-egninering" - basically produce basic UML elements such as classes and their relationships, that might be easier to visualize.

As one of the tools I might recommend http://www.magicdraw.com/ - it has different editions for different languages.

Aleksei Potov
+1  A: 

First of all, take small steps.

I do .Net development so this will be specific to Visual Studio. It also assumes you know next to nothing about the code. It also assumes a form-based app.

  1. Figure out what section you want to learn.
  2. Start the app with the debugger.
  3. Run to the section you want to learn.
  4. Pause the debugger.
  5. Step Into the app (F11).
  6. Press the button / menu item to start your section (and hopefully no one overrode Paint or MouseOver or something like them).
  7. Start documenting...
Austin Salonen
+2  A: 

Firstly, I hope you at-least have symbols and a debugger, or the source code.

You will need the following:

  1. A large whiteboard
  2. Alot of coke/redbull/coffee
  3. Sleeping Bag
  4. Friends

Depending on the size of the source.

But I look at it like a library. Which portions of API are 'top level' (they may be tied to the GUI or first entered on the call stack).

Then I map those calls, calls they have in common. Just on a whiteboard using lines. Then I guess at what each function does.

Keep Revising and restructuring your whiteboard as you learn what does what and why. Eventually you will have something to put on paper and get really down and dirty with.

Aiden Bell
+1  A: 

In situations where I've had to do this, I like to look at entry/exit points first, and label those first. Then, from there, look for common underlying functions/methods supporting the public items.

If there is sparse or no documentation, it can useful to think of this as breaking into a building.

  1. Is it a large building with lots of entrances/exits?
    Is this an application or component that spans multiple processes or subsumes an entire server?

  2. What powers the building?
    What frameworks are visible that were used to construct it?

  3. Is there public access, say on a ground floor?
    What public interfaces or interactions are possible?

From here, I would label all the public methods, structures, and classes available. This might entail labeling everything, though, if you are in a language or environment where everything is public-access. Particularly interesting items are those that have analogous start/stop, begin/end, pause/resume prefixes. These usually hint at high-powered control of specific items. Things like pascal-casing, use of m_ markers for member variables can hint at internal operations here too.

Continuing, it may be important to know about its file formats, communication activity (network? pipes? etc?), and security. Each of these can be broken down with similar analogies if you'd like. Rather than tire everyone with metaphors, here are some outlines.

File Formats

  1. Is text visible in files generated by the component? Does it resemble any popular format for text or text-transport? (XML, JSON, HTML, SGML, CSV, Tab-Delimited, etc.) This may reveal the intended recipient or source of communications with the app as well.
  2. Is there an offset or size stored in the first 4 bytes of the file? First 8 bytes? First 4 bytes after the very first 4? This can hint at binary layout or raw packets.
  3. Is there a well-known four-character marker present in the file? Media types frequently have FOURCC codes to label content. Container formats such as AVIs will frequently contain multiple instances and embedded items. Also note if there are consistently named items, if you have the source.

Communications

  1. Do new socket connections show up (netstat -a on windows) when the component is active?
  2. Are there sockets that are now listening as a result? This hints at a server/recipient component.
  3. Are there outbound connections? Where do the outbound connections go? What server ports do they attempt to contact? HTTP tends to be the most interesting item you may find, here.

Security & Operation Access

  1. Does the component fail when run as the lowest-privileged user available? Does it require elevated permissions to run?
  2. Does the component read or write to the registry, user profile, or any temp directories? Does it attempt to access locations for shared users?
  3. Does the component generate exceptions? What kind of exceptions (.NET managed? Native? POSIX Signals?). It can be very useful to locate specific exception declarations if the source code is written in a language/framework that allows them.

Finally, if none of this is able to increase your working knowledge of the code/library in question, consider the motive behind the creation or use of the code/library.

  1. What problems does it attempt to solve?
  2. What usage benefits might be gained by using it?
  3. Is this component actually intended to operate as part of a larger component or system?

These questions can lead to clues on what entry/exit points are the most interesting.

meklarian