views:

79

answers:

3

I am currently working on an application that parses huge XML files.

For each file, there will be different processes but all of them will be parsed into a single object model.

Currently, the objects parsed from each XML file will go into a single collection.

This collection is also used during parsing, e.g. if a similar object already exists, it will modify the object's property instead, such as adding count.

Looking at the CPU graph when this application is running, it is clear that it only uses part of the CPU (one core at a time on 100%), so I assume that running it on parallel will help shave running time.

I am new into parallel programming, so any help is appreciated.

A: 

I suggest that you look at using threads instead of parallel programming.

Threading Tutorial

Nissan Fan
and shared memory, no doubt.
windfinder
Threads are one way of doing parallel programming.
Darin Dimitrov
A: 

Instead of trying to managed threading yourself (which can be a daunting task), I suggest using a parallel library. Look at PLINQ/TPL for what is coming in .Net. CTPs can be downloaded here.

Peter Lillevold
A: 

I would suggest you the following technique: construct a queue of objects that wait to be processed and dequeue them from multiple threads:

  1. Create an XmlReader and start reading the file node by node while not EOF.
  2. Once you encounter a closing tag you can serialize the contents it into an object.
  3. Put the serialized object into a Queue.
  4. Verify the number of objects in the Queue and if it is bigger than N, kick a new thread from the ThreadPool which will dequeue <= N objects from the queue and process them.

The access to the queue needs to be synchronized because you will enqueue and dequeue objects from multiple threads.

The difficulty consists in finding N such that all the CPU cores work at the same time.

Darin Dimitrov