High performance text file parsing in .net

views:

167

answers:

+1 Q:

High performance text file parsing in .net

Here is the situation:

I am making a small prog to parse server log files.

I tested it with a log file with several thousand requests (between 10000 - 20000 don't know exactly)

What i have to do is to load the log text files into memory so that i can query them.

This is taking the most resources.

The methods that take the most cpu time are those (worst culprits first):

string.split - splits the line values into a array of values

string.contains - checking if the user agent contains a specific agent string. (determine browser ID)

string.tolower - various purposes

streamreader.readline - to read the log file line by line.

string.startswith - determine if line is a column definition line or a line with values

there were some others that i was able to replace. For example the dictionary getter was taking lots of resources too. Which i had not expected since its a dictionary and should have its keys indexed. I replaced it with a multidimensional array and saved some cpu time.

Now i am running on a fast dual core and the total time it takes to load the file i mentioned is about 1 sec.

Now this is really bad.

Imagine a site that has tens of thousands of visits a day. It's going to take minutes to load the log file.

So what are my alternatives? If any, cause i think this is just a .net limitation and i can't do much about it.

EDIT:

If some of you gurus want to look at the code and find the problem here are my code files:

The function that takes the most resources is by far LogEntry.New The function that loads all the data is called Data.Load

Total amount of LogEntry objects created: 50 000. Time taken: 0.9 - 1.0 seconds.

CPU: amd phenom II x2 545 3ghz.

not multithreaded

+1 A:

You could try RegEx. Or change the business process so the file can be loaded at that speed more conveniently.

machine elf 2010-03-20 06:26:05

I replaced some functions with regexes some time ago, it was slower. There still me a minor improvement somewhere but the chance is small.

diamandiev 2010-03-20 06:27:32

did you compile the regex statements?

Tom Anderson 2010-03-20 07:45:06

As in: Regex r = new Regex(@"(my)? +regex", RegexOptions.Compiled);

Callum Rogers 2010-03-20 10:29:06

+2 A:

Have you already looked at memory mapped files? (thats in .NET 4.0 though)

EDIT :- Also, Is it possible to split those large files into smaller ones and parsing the smaller files. This is something we have done in some of our large files and that was faster than parsing giant files.

ydobonmai 2010-03-20 06:42:06

it seems that the prob is not the IO but the cpu. I have a SSD hard disk btw.

diamandiev 2010-03-20 08:05:49

+3 A:

Without seeing your code, it's hard to know whether you've got any mistakes there which are costing you performance. Without seeing some sample data, we can't reasonably try experiments to see how we'd fare ourselves.

What was your dictionary key before? Moving to a multi-dimensional array sounds like an odd move - but we'd need more information to know what you were doing with the data before.

Note that unless you're explicitly parallelizing the work, having a dual core machine won't make any difference. If you're really CPU bound then you could parallelize - although you'd need to do so carefully; you would quite probably want to read a "chunk" of text (several lines) and ask one thread to parse it rather than handing off one line at a time. The resulting code would probably be significantly more complex though.

I don't know whether one second for 10,000 lines is reasonable or not, to be honest - if you could post some sample data and what you need to do with it, we could give more useful feedback.

EDIT: Okay, I've had a quick look at the code. A few thoughts...

Most importantly, this probably isn't something you should do "on demand". Instead, parse periodically as a background process (e.g. when logs roll over) and put the interesting information in a database - then query that database when you need to.

However, to optimise the parsing process:

I would personally not keep checking whether the StreamReader is at the end - just call ReadLine until the result is Nothing.
If you're expecting the "#fields" line to come first, then read that outside the loop. Then you don't need to see whether you've already got the fields on every iteration.
If you know a line is non-empty, it's possible that testing for the first character being '#' could be faster than calling line.StartsWith("#") - I'd have to test.
You're scanning through the fields every time you ask for the date, time, URI stem or user agent; instead, when you parse the "#fields" line you could create an instance of a new LineFormat class which can cope with any field names, but specifically remembers the index of fields that you know you're going to want. This also avoids copying the complete list of fields for each log entry, which is pretty wasteful.
When you split the string, you have more information than normal: you know how many fields to expect, and you know you're only splitting on a single character. You could probably write an optimised version of this.
It may be faster to parse the date and time fields separately and then combine the result, rather than concatenating them and then parsing. I'd have to test it.
Multi-dimensional arrays are significantly slower than single-dimensional arrays. If you do want to keep to the "copy all the field names per entry" idea, it would be worth separating into two arrays: one for the fields, one for the values.

There are probably other things, but I'm afraid I don't have the time to go into them now :(

Jon Skeet 2010-03-20 07:12:16

It's quite big code, ill try to post some later.

diamandiev 2010-03-20 07:29:27

added the code files, you can check them out if you like thanks :)

diamandiev 2010-03-20 07:39:28

you are great thanks, ill try the things you suggested.

diamandiev 2010-03-20 15:51:12

You could try lazy loading: For example, read the file 4096 bytes at a time, look for line endings and save all line endings in an array. Now, if some part of your program wants the LogEntry N, look up the start position of that line, read it and create a LogEntry object on the fly. (This is a bit easier with Memory Mapped files.) As possible optimizatons, if the calling code usually needs consecutive LogEnties, your code could e.g. read-ahead the next 100 log entries automatically. You could cache the last 1000 entries that were accessed.

nikie 2010-03-20 08:00:37

Well i need all the log entries i can get. Because right after all is loaded i have to display the count of all requests, pageviews, visitors etc. If they are not loaded into memory i cannot count them. For example to count the pageviews i must load the user-agent i cant just count the lines.

diamandiev 2010-03-20 08:04:39

Have you considered loading log entries into a database and querying from there? This way, you'd be able to skip parsing log entries you've already stored in the database.

John Saunders 2010-03-20 10:22:07

You can do several things :

A windows service which continuously parse the log each time it is changed. Then your UI request this service.

Or you can parse it every minutes or more and cache the result, do you really need it to be in real time ? maybe it only need to be parsed once ?

Nicolas Dorier 2010-03-20 10:35:01

ansaurus

tags:

views:

answers:

High performance text file parsing in .net

related questions