views:

360

answers:

8

We're building tools to mine information from the web. We have several pieces, such as

  • Crawl data from the web
  • Extract information based on templates & business rules
  • Parse results into database
  • Apply normalization & filtering rules
  • Etc, etc.

The problem is troubleshooting issues & having a good "high-level picture" of what's happening at each stage.

What techniques have helped you understand and manage complex processes?

  • Use workflow tools like Windows Workflow foundation
  • Encapsulate separate functions into command-line tools & use scripting tools to link them together
  • Write a Domain-Specific Language (DSL) to specify what order things should happen at a higher level.

Just curious how you get a handle on a system with many interacting components. We'd like document/understand how the system works at a higher level than tracing through the source code.

+3  A: 

I use AT&T's famous Graphviz, its simple and does the work nicely. Its the same library Doxygen uses too.

Also if you make a little effort you can get very nice looking graphs.

Forgot to mention, the way I use it is as follows (because Graphviz parses Graphviz scripts), I use an alternative system to log events in Graphviz format, so I then just parse the Logs file and get a nice graph.

Robert Gould
Logging info into Graphviz format is a really cool idea -- thanks!
kurious
+2  A: 

The code says what happens at each stage. Using a DSL would be a boon, but possibly not if it comes at the cost of writing your own scripting-language and/or compiler.

Higher level documentation should not include details of what happens at each step; it should provide an overview of the steps and how they relate together.

Good tips:

  • Visualize your database schema relations.
  • Use visio or other tools (like the one you mentioned - haven't used it) for process overviews (imho it belongs to the specification of your project).
  • Make sure your code is properly structured / compartmentalized / etc.
  • Make sure you have some sort of project specification (or some other "general" documentation that explains what the system does on an abstract level).

I wouldn't recommend building command-line tools unless you actually have a use for them. No need in maintaining tools you don't use. (That's not the same as saying it can't be useful; but most of what you do sounds more like it belongs in a library rather than executing external processes).

Thanks for the tips -- I like visio / specs, but inevitably they seem to get out of date. Ideally the visualization could come from the code itself (like the db schema relations).Agree -- command-line tools for their own sake aren't useful, but sometimes scripts are easier to scan than code.
kurious
+1  A: 

My company writes functional specifications for each major component. Each spec follows a common format, and uses various diagrams and pictures as appropriate. Our specs have a functional part and a technical part. The functional part describes what the component does at a high-level (why, what goals it solves, what it does not do, what it interacts with, external documents that are related, etc.). The technical part describes the most important classes in component and any high-level design patterns.

We prefer text because is the most versatile and easy to update. This is a big deal -- not everyone is an expert (or even decent) at Visio or Dia, and that can be a obstacle to keeping the documents up-to-date. We write the specs on a wiki so that we can easily link between each specification (as well as track changes) and allows for a non-linear walk though the system.

For an argument from authority, Joel recommends Functional Specs here and here.

ARKBAN
+1  A: 

I find a dependency structure matrix a helpful way to analyze the structure of an application. A tool like lattix could help.

Depending on your platform and toolchain there are many really useful static analysis packages that could help you to document the relationships between subsystems or components of your application. For the .NET platform, NDepend is a good example. There are many other for other platforms though.

Having a good design or model before building the system is the best way to have an understanding for the team of how the application should be structured but tools like those I mentioned can help enforce architectural rules and will often give you insights into the design that just trawling through the code cannot.

Hamish Smith
A: 

Top down design helps a lot. One mistake I see is making the top down design sacred. Your top level design needs to be reviewed and update just like any other section of code.

Jim C
+1  A: 

I wouldn't use any of the tools you mentioned.

You need to draw a high-level diagram (I like pencil and paper).

I would design a system that has different modules doing different things, it would be worthwhile do design this so that you can have many instances of every module running in parallel.

I would think about using multiple queues for

  • URLs to Crawl
  • Crawled pages from the web
  • Extracted information based on templates & business rules
  • Parsed results
  • normalizationed & filtered results

You would have simple (probably command-line with no UI) programs that would read data from the queues and insert data into one or more queues (The Crawler would feed both the "URLs to Crawl" and "Crawled pages from the web"), You could use:

  • A web crawler
  • A data extractor
  • A parser
  • A normalizer and filterer

These would fit between the queues, and you could run many copies of these on separate PCs, allowing this to scale.

The last queue could be fed to another program that actually posts everything into a database for actual use.

Osama ALASSIRY
Thanks, I like the conceptual idea of having separate queues.
kurious
A: 

It's important to partition these components throughout your software development life cycle - design time, development time, testing, release and runtime. Just drawing a diagram isn't enough.

I have found that adopting a microkernel architecture can really help "divide and conqure" this complexity. The essence of the microkernel architecture is:

  • Processes (each component runs in an isolated memory space)
  • Threads (each component runs on a separate thread)
  • Communication (components communicate through a single, simple message passing channel)

I have written a fairly complex batch processing systems which sound similar to your system using:

Each component maps to .NET executable Executable lifetimes are managed through Autosys (all on the same machine) Communication takes place through TIBCO Rendezvous

If you can use a toolkit that provides some runtime introspection, even better. For example, Autosys lets me see what processes are running, what errors have occurred while TIBCO lets me inspect message queues at runtime.

Mo Flanagan
A: 

I like to use NDepend to reverse engineer complex .NET code base. The tool comes with several great visualization features like:

Dependency Graph: alt text

Dependency Matrix: alt text

Code metric visualization through treemaping: alt text

Patrick Smacchia - NDepend dev