views:

1679

answers:

21

I've been a programmer for years now and I feel very comfortable in a handful of languages, especially C.

What I find baffling is that while I have no problem reading code from textbooks and example sites, no matter what the code is doing, I find that when I check out an open source project and look at the code, it almost looks like a foreign language to me.

I get flustered quickly. I have no idea what data structures are important, or even where the main() function is for a lot of these projects (or ones that I'd expect to have one). It's almost embarrassing that I can't figure out anything that's going on in popular open source projects. I've looked at the code from projects such as Bash, Wget, Linux Kernel (this proved futile quickly), and Python, just to figure out the flow of the code, to no avail.

My question to you (more experienced) programmers is:

How do you learn other people's code? What practices are good in understanding what's going on in a large project? Are there specific conventions that most projects follow?

+1  A: 

Reading static code is helpful.
Reading how the project evolves is the best though. You can do that by looking at SVN diffs and the issue tracker to see how the issues get solved.

cherouvim
+2  A: 

It's usually hard to understand other people's code on "first reading". I usually find specific part of code, which I'm interested in and study only that one small part, with everything it depends on.

You can't really read for example OpenOffice's sources and instantly understand everything.

Darth
+6  A: 

Sometimes compiling it then running it in a debugger, stepping through as you go can be helpful to figure out the general flow of the program.

Running it in a profiler/code coverage tool would give you some interesting background information too about what functions are most used, etc.

Eric Petroelje
+1  A: 

Refactoring it as you read it may prove also useful

Manrico Corazzi
+28  A: 

Being able to grasp existing code is, in my view, one of the greatest challenges of the software engineering research field. There is a lot of interesting work in reverse engineering, but no real practical way to help your average programmer understand what is going on. I am also not familiar with any commercial tools that are as successful and popular as tools we have for other tasks like source-control or custom builds.

When most people come out of college, they only know how to write programs from scratch. They find it hard to start contributing to existing systems, and most of them have an aversion to it (which often causes the "not invented here syndrome").

I don't think that there is an organized way to learn to do it beyond experience. I can only give you one tip: Don't look at the code itself, try to understand the components and their interfaces before you read any code.

For example, if you are looking at a system built in Java, try to see what the packages are. Then try to see which packages depend on one another. Try to build yourself mentally (or physically) a graph of the dependencies. Being able to have this topological order of examining the system helps. If you are doing this in C, you can do the same for headers, and hopefully someone arranged things by directories. Then try to do the same with specific classes and functions within these packages. Try to think of the whole system as interconnected APIs.

Try to ignore any code that is not a call to some public "API" functionality. Looking at the actual code too early is distracting, since you can't understand the grand scheme of things while looking at this specific level.

The important thing to remember is that there is nothing about real production code in any specific function that makes it magical or different than any other code (except that it might have been fixed or cleaned up more). It is in how the big picture is organized and architected that the true knowledge, skills, and quality of the project lie.

Uri
+1 Uri, you enlightened me. Thanks a lot.
Joset
You make some very valid points. +1
Yuval A
I know a lot of people that came out of college knowing more on how to modify other people's code than write their own from scratch... after all... they have been practicing a lot while copying and modifying homework all those years :P
luvieere
+1  A: 

Here is the thing. When reading the book, you have an idea...or step by step idea whet the author is trying to do. When you look at OpenSource projects you have an idea what the project is but not a clear step by step whet developer(s) are trying to do. If you try to under the entire app via the Source Code then you are definitely confusing your self.

I usually download the Source (if its an exe program or web app), compile it and then run it. Then I look at interesting functionality and then disect the code for that functionality.

Oh yes...the debugger is your friend.

Saif Khan
+1  A: 

Pick simpler projects to start with. Bash and Linux (I never checked wget) are big projects with an history, and that inevitably raise the step to get into their code.

Another factor is the tools you're using. It's easier with Java than with C, but an IDE (or other) which can generate class diagrams, show call graphs, module dependencies and outlines of a class helps a lot.

Breakpoints in stuff you don't understand usually help figure out the last tricky points. Getting them to trigger might be a challenge on its own!

ptyx
+1 for the breakpoint caveat. Sometimes it's hard enough to figure out how to trigger a breakpoint in code I've been working with for years.
Joe White
+2  A: 

By the way, when you do read code, it is very important to be wary of the fact that just because the code looks intuitive or understandable (e.g., a series of API function calls) doesn't mean that understanding it lead to an understanding of how to apply the API.

One of the problem with modern APIs is that many of them are written intuitively, but if one went and investigated the documentation for each of the API functions, he would find a lot of important notes and caveats. The programmer who wrote the code may have been aware of them, but if you simply learn those mini-patterns and try to apply them without understanding the restrictions of the invoked functions, you could end up eventually writing bad code.

Uri
+15  A: 

I go for a top-down approach. Grep around until I find some kind of entry point (like main()) and understand what that function does, then examine the functions it calls. Pretty soon you get an idea for the overall control flow of the code - is there a message loop? is there a processing pipeline?

And of course the more code you read the more you get into the idosyncracies of that codebase (naming conventions, unit organisation and so on).

Mark Pim
voted up because this is exactly what I do
Oliver N.
If you're doing this, you probably want to use a tool that keeps track of the locations you've explored and let's you backtrack.
Uri
+1. Don't forget to take notes.
Duke Navarre
A tool that lets me backtrack. How about the shell command history?
ardsrk
My "tool" is a whiteboard. I like to create call graps of code I'm trying to figure out. That way when I get lost I can just consult the graph to see how deep in the stack this routine typically is. (Note: a bad sign is when your call graph isn't even close to readable without resorting to three dimensions).
T.E.D.
+2  A: 

There is a big leap, especially in C, between a code example to show one specific trick, and a program that performs many operations on complex structures.

One of the reasons C++ and Java came to life was to simplify the very common structure based programming found in complex C programs.

For instance, an operating system has seemingly few lines of code, but it's got a structure of function pointers that the code iterates through in order to run the real meat of the schedulers and other needed code - mostly in the name of modularity so it's easy to replace one algorithm with another, or more importantly support many processors with one codebase. It's not straightforward because the system must be flexible enough to add a new task or module without editing the code itself, instead adding a few items to a few different structures.

There is a continuum between the really complex programs like operating systems that require many levels of abstraction, to simple text editors, and along the way you'll find different techniques to make the developer job easier, once you understand how the system actually works.

This will come through experience and code reading. Once you learn a few of the major tricks then new pieces of software will be easier to parse and understand. But the best way to do that is to dig in, play with the code, and ask questions on the developer mailing lists. There are a lot of people that will gladly help you if it means you might work on the code.

Adam Davis
+6  A: 

The biggest obstacle is usually not the way the code is written but the problem it is trying to solve. The domain is the thing you need to understand. When you read through examples, chances are you searched out an example to solve a specific problem which you were already familiar with. Unless you are familiar with the problem domains of operating systems or programming languages, I'm not surprised you find the source to Linux or Python bewildering :-)

Jim Arnold
A: 

Out of your list (Bash, Wget, Linux Kernel, and Python) I would say the reasonable one to start with would be wget. I say this because it is very focused on one sort of task (unlike the others) which will make it easier to follow.

Grep and ctags (or an IDE that can help you navigate between function calls - sort of a "take me to the implementation of this function" thing) are great tools for tracking things down.

Generally when I am trying to understand some alien code I try to narrow down the specific area that I am interested in by guessing at what the file names mean and grepping through the code base for something that looks like it is what I am after. Alternatively I'll find main (again grep is a good tool for that) and follow the flow.

If things are proving hard to undertand I'll run the code in a debugger and step through it. Stepping through code is a great way to learn what it does, I wish more people learned this early on in school.

TofuBeer
+4  A: 

Picking up a many-multi-developer, many-multi-year, many-multi-kloc code base like the Linux kernel or bash and expecting to get the big picture is probably not going to work. Like you, I sometimes have trouble even finding main. (grep, or better yet ack, is your friend.)

However, by picking a particular point of interest - "What code outputs this USB error in the Linux kernel?" or "What part of Samba implements this LDAP option?" - I've had much better luck. Because, honestly, I suspect that very few developers (open or closed source) spend a very large percentage of time working on big picture issues of large code bases; most time is spent working on particular points of interest.

A complicating factor of these codebases is that they have evolved all sorts of special cases to build and compile on a variety of platforms, with a variety of libraries, and a variety of compile-time options. That's not something you'll find in a textbook or in some example code.

Josh Kelley
A: 

Don't try to just blindly go around looking through the entire codebase, trying to understand everything. For larger projects, almost nobody understands the entire source tree. Usually people have components that they know very well, and they tend to develop mainly what they already understand. Trying to take on too much will make you easily flustered.

Try to understand just one thing you are interested in. For example, say you wanted to know how glibc implements posix semaphores. You would check out the glibc source code. Grep for sone of the functions like sem_post. This would find you several architecture specific implementations. You might decide to look at the generic version: ./nptl/sysdeps/unix/sysv/linux/sem_post.c. You see two versions - an old one that just does a simple atmoic inc, and then wakes one thing waiting on the sem with lll_futex_wake. You might look at the implementation of this function, or you might look at the newer sem_post and see that the algorithm is slightly different. It uses a CAS loop to do the increment, and then only if the number of waiters is positive does it attempt to do the futex wake. A couple things you might try to understand about this are: why does it use a CAS loop instead of just an atomic increment (for that you'd probably need to look at the revision history - probably some architecture that doesn't implement atomic increment in hardware, but does implement CAS). Why does the read of nwaiters not have to be atomic? (you can probably bet it has something to do with the semantics of a semaphore and the memory barrier before it). How much performance is gained by only doing the futex wake if there is something waiting?

The idea is to understand one small tiny piece of code. To understand that fully you will have to branch out to other things that it touches. After a little while you will have explored a pretty decent chunk of code without even realizing it.

Greg Rogers
+2  A: 

Experience has shown me that there are 3 major goals you have when learning a system.

* Learn what the code is supposed to do
* Learn how it does them
* (crucially) Learn why it does them the way it does

All three of those parts are very important, and there's a few tricks to help you get started.

First, resist the temptation to just ctrl-click (or whatever your IDE uses) your way around the code to understand everything. You probably won't be able to keep everything in perspective in your mind this way, especially when each line forces you to look at multiple other classes in order to understand what it is.

Read documentation where possible; it usually helps you quickly gain a mental framework upon which to build everything that follows.

Run test cases where possible. Unit tests are your friend.

Don't be afraid to ask someone who knows if you have a question. Granted, you shouldn't waste other employees' time with inane queries, but if there's something that you simply don't understand (this is especially true with more conceptual questions like, "Wouldn't it make much more sense to implement this as a ___" or something), it's probably worth finding out the answer before you mess something up and don't know why.

When you do finally get down to reading the code, start at a logical "main" place and go from there. Don't just read the code top to bottom, or in alphabetical order, or anything (this is probably obvious).

(This is an answer I posted to an older question, "Learning a Legacy Java System")

mandaleeka
and how many times you think it's necessary to repost?
SilentGhost
Doesn't it answer the question? That's good enough reason to post it, I think. It's not like I'm wasting massive server space or anything...
mandaleeka
+1  A: 

I'm surprised no one's mentioned it yet, but, if you haven't explored it yet, check out "Design Patterns" by Gamma, Helm, Johnson, and Vlissides, or alternatively, "Head First Design Patterns". Design Patterns significantly help in abstracting parts of large projects - they'll help you understand other people's code, better abstract your own code and get things done faster, communicate better, and it will reinforce some concepts of OOP. For example, if you want to look at the source code for an operating system, knowledge of the Concurrency Patterns would be useful. Check the Wikipedia Article Here
Once you have the abstraction figured out, then you'll have an easier time digging in and figuring out the details.

T.R.
A: 

One of the difficulties with getting into a large, existing codebase such as any of the projects you mentioned, is developing a mental model of the problem domain combined with the design patterns used to approach it.

So, the first goal is to grok the problem domain. For that, you usually have no recourse other than documentation. Don't forget to look at the end user's documentation and even any marketing material. It may even be helpful to talk to end users about what they are trying to achieve with the tool. If available, talking to either or both of the product support team and test test team may be at least as valuable since they live at the boundary between tool and its users.

The second goal is to find your way around the sources, and build the model of the abstractions, key algorithms, and design patterns in play. I've personally used two general approaches to achieve this, and some combination of the two usually results in progress:

  1. Have a specific goal. Identify a "small" bug or new feature, and set out to fix it or implement it. There's nothing like a specific goal to focus attention on how some part of the system works. Don't be afraid to try different approaches, and unit tests are going to be your friend as you explore.

  2. Just read the code. Use a tool like Doxygen, JavaDoc, or whatever the team is already using to build internals documentation to get call graphs, and to build an easy to use index to the code itself. Even when faced with a huge, undocumented pile of plain C, doxygen will at least get you a comprehensive list of all functions and their obvious releationships. If possible, when faced with underdocumented code, take the opportunity of your learning the code base to improve its documentation, too.

RBerteig
A: 

A very hands-on answer, and (unfortunately) VS2005-specific: I like to use the bookmark feature to keep track of where I am, adding my own notes in the bookmark titles and nesting things to show call trees. This is better then simply running doxygen or other tools on a code base, because it tracks the stuff YOU'RE interested in.

On a related note, I wish IDE developers thought more about code-reading features. Visual Studio seems to have the best bookmarks around (compared to Eclipse, Netbeans, Komodo), but they're far from good. For instance, being able to create multiple bookmarks for a single line of code and nest things to an arbitrary depth (like regular browser bookmarks) would go a long way to tracking a journey through other people's code. I've actually looked into creating a VS addin to do this, but it's way too much work. Microserfs, are you reading?

Failing this, I sometimes put together sequence diagrams in Enterprise Architect to get an idea of what's going on. This, however, is usually pretty time-consuming.

fakeleft
+1  A: 

What I've found helpful is to try to trace the main execution thread through the program. While doing this, I create a whiteboard graph of what routines call what other routines.

In extreme cases (probably not most published source code), it can be easier to just go through every routine in the sources, note what other routines call it (search for the routine's name in the source file(s)), and then build yourself a call graph from the resulting data.

T.E.D.
+1  A: 

For small to medium size projects (or for individual modules) I like to log the execution of code.

I have written a small script that embeds a logging mechanism inside every routine and as the first block to be executed. Then I run the program, interact a bit and I analyze the log file. Step-by-step I remove the logging from various routines, to clean it a bit. This is a very fast way to X-ray other people's code.

Nick D
A: 

There's a book called Code Reading: The Open Source Perspective, which might help you out. It's intended to help people with your particular problem.

Richard Hein