views:

167

answers:

9

I have a service running which periodically checks a folder for a file and then processes it. (Reads it, extracts the data, stores it in sql)

So I ran it on a test box and it took a little longer thaan expected. The file had 1.6 million rows, and it was still running after 6 hours (then I went home).

The problem is the box it is running on is now absolutely crippled - remote desktop was timing out so I cant even get on it to stop the process, or attach a debugger to see how far through etc. It's solidly using 90%+ CPU, and all other running services or apps are suffering.

The code is (from memory, may not compile):

List<ItemDTO> items = new List<ItemDTO>();
using (StreamReader sr = fileInfo.OpenText())
{
    while (!sr.EndOfFile)
    {
        string line = sr.ReadLine()
        try {
           string s = line.Substring(0,8);
           double y = Double.Parse(line.Substring(8,7));

           //If the item isnt already in the collection, add it.
           if (items.Find(delegate(ItemDTO i) { return (i.Item == s); }) == null)
               items.Add(new ItemDTO(s,y));
         }
         catch { /*Crash*/ }
    }
    return items;
}

- So I am working on improving the code (any tips appreciated).

But it still could be a slow affair, which is fine, I've no problems with it taking a long time as long as its not killing my server.

So what I want from you fine people is: 1) Is my code hideously un-optimized? 2) Can I limit the amount of CPU my code block may use?

Cheers all

A: 

Do you really need to hold all the data in memory? You can store it in database (if you need something simple and poweful use Sqlite) and process it using sql.

Giorgi
+1  A: 

can't you bulk load this file with the SqlBulkCopy Class and then do the processing on the database server?

SQLMenace
I have now started getting System.OutOfMemory exceptions being thrown.I would still need to store it in memory to use SqlBulkCopy though wouldnt I?In fact the only way i know of doing that would be to crete a DataTable object, in memory, and fill it up. much like i am already..
jb
+5  A: 
  1. Doing the find on the list is an O(n) operation, that means that as the list gets longer it takes longer to search for the items. You could consider putting the items into a HashSet in .NET 4.0/3.5 or use a Dictionary for earlier versions of .NET which can act like an index, if you need the items in the list to maintain the original order you can continue to put them in the list, but use the HashSet/Dictionary to do the checks.

  2. You could also run this code in a BackgroundWorker thread this will help keep the UI responsive while the process is running.

Chris Taylor
+1  A: 

In answer to 1) I'd use a sorted list (if there is a lot of redundant data) or a hash dictionary instead of a regular one to speed searching.

Here is another post that will help you decide between the two approaches.

for question 2), I'd set the thread priority to lower than normal. See here.

Laramie
+6  A: 

Rather than limit its CPU usage, you'd probably be better off setting it to idle priority, so it'll only run when there's nothing else for the box to do. Others have already mentioned optimization possibilities, so I won't try to get into that part.

Jerry Coffin
+2  A: 

Find on a List is O(n). If the file has 1.6 million rows (ie 1.6 million items), you'll be repeatedly walking a list of 1+ million rows, which will waste a lot of time.

As others have suggested, if you're doing a lot of searching, then, you need a better data structure. One that is designed for faster searches.

If using .NET 3.5, you can use the HashSet collection, which gives you an amortized O(1) for searches. Or Dictionary collection is using .NET 2.0

Next you have to ask yourself, if the file has 1.6 million rows, do you have enough memory? If you do, then parsing the file in memory will be faster than sending it to the database for processing for duplicates, but if you don't have enough memory, then you'll be paging. A lot. (which is what is probably happening now).

Alan
A: 
  • hashset
  • threading with lower priority
  • some sort of sql bulk insert
Adabada
+1  A: 

As others have said, fix the datastructure.

Now, my eyes hit this phrase "periodically checks a folder for a file and then processes it." How often is "periodically" and why process a file that likely has not changed?

You may want to take a look at System.IO.FileSystemWatcher http://msdn.microsoft.com/en-us/library/system.io.filesystemwatcher.aspx

Karl Strings
A: 

I'm not a c# programmer, but looking at the logic i think

  1. You are creating a new string object every time in the loop. If i have to do it in java, rather than using string object, i would have used StringBuffer.

  2. Your data file is big, so i think you should have logic to purge the information in the database after every 'n' number of records. You will need additional logic to record which records are purged so far. Alternatively, since your logic captures only first row of data & ignores subsequent duplicates, rather than using Find method can't you just try to insert the data and capture the sql failure.

  3. The processing logic should be in a separate thread to keep the system responsive.

prasrob