ansaurus

Question

Efficiently parsing a large text file in C#

Answer 1

+2 A:

I would say that in general your approach is right but there is scope for parallelism. I would suggest that you start multiple threads or tasks (in .NET 4) each parsing part/chunk of file. Also instead of reading line by line, read in chunk of bytes - will give better performance from disk IO perspective.

Edit: Here's the outline of solution.

Let's say we will process M chunks of N characters at the time (because we want to limit amount of memory needed and number of threads used).
Allocate N*M character buffer. We will use this buffer cyclically.
Will use producer-consumer pattern. Producer will fill the buffer. It will try to find word boundary near chunk boundary (i.e. near every Nth character). So we will have M chunks of approx N characters with start and end index within the buffer
Now launch M worker threads to process each chunk. Each worker will use its own dictionary to count words - this will eliminate need for thread synchronization.
Will aggregate the results at end of iteration. The process needs to be repeated till entire file is read.

Of course, I am assuming really huge files for taking this approach. I will probably use old style character lookup in buffer to find word boundary marking lookup code as unsafe to avoid bound checks.

VinayC 2010-08-27 12:01:37

but make sure you don't split a token

Scoregraphic 2010-08-27 12:21:30

Of course - its a bit difficult solution. Will edit my response to outline it.

VinayC 2010-08-27 12:41:07

Answer 2

A:

At a very basic level, I'd start with a Dictionary<string, int>, string.split the document on spaces, and keep count via simple parsing of that data.

string.split is a relatively robust method that, and someone will certainly correct me if I'm wrong, was built to use regular expressions and is immensely more complex than what you need for this scenario.

Writing your own split method will likely be a more viable solution than the one in the framework. I suggest using the off-the-shelf version first as described above, then rewrite your own if you determine that performance is an issue.

Ian

Ian P 2010-08-27 12:02:01

Taking a look at string.Split in Reflector and there's certainly no regex magic - it actually uses pointers to iterate over the string looking for the delimiters. However, you are right that it might be over complex; the [MSDN](http://msdn.microsoft.com/en-us/library/b873y76a.aspx) page states that it might use a lot of memory and instead use IndexOf to find the delimiters.

Sam 2010-08-27 12:31:27

"string.split ... was built to use regular expressions" I would be *stunned* if it was, much more likely that it iterates through the string attempting to match the tokens passed to it. However I have no evidence to back this up.

Binary Worrier 2010-08-27 12:33:25

Answer 3

+1 A:

I agree with the comment by PoweRoy: why not try it out? Maybe there is no problem in practice.

If you do need something else, you could try writing some code that takes a Stream and returns an IEnumerable<string>. It would read characters from its input one at a time - if you need buffering for efficiency you can always wrap the FileStream you are actually giving this code in a BufferStream - and checks if it's a space (or possible an EOL?). If it isn't, it will add the character to a string buffer (perhaps a StringBuilder?), but if it is it will yield return the current string buffer and clear it.

After that you can just foreach over the result of calling this code on the content of the file and you will get the codes from the file one by one.

You could then use some kind of data structure like a Dictionary<string,int> to count the number of occurrances for each code, keeping the code as key, and the count as value. But this step would be same if you read the file line by line and use string.Split to split them on spaces.

peSHIr 2010-08-27 12:02:09

Answer 4

A:

If there are no other restrictions, you have to read through the complete file as you described.

To save the codes and the count you should use a datastructure that allows search and insert in O(log n) time. SortedDictionary will do that in C#.

EDIT:

What is the best structure to store my dictionary of results? HashTable, SortedList, HybridDictionary

Because sorted order seems not to be required a HybridDictionary or a Dictionary will perfom better in most cases. SortedList will probably be the slowest solution, because inserts take O(n). You should do some tests with the different implementations if performance is so important.

rudi-moore 2010-08-27 12:08:16

I would go with a `HybridDictionary` (http://msdn.microsoft.com/en-us/library/system.collections.specialized.hybriddictionary.aspx), as (at least we) don't know how many elements there are in the collection at the end

Scoregraphic 2010-08-27 12:19:46

You are right. Edited the answer.

rudi-moore 2010-08-27 14:36:58

Answer 5

+4 A:

Your approach looks fine.

Read in line per line
Split each line by space
Add a record to a dictionary if it doesn't exist yet and if it does exist, do the value++

Carra 2010-08-27 12:12:01

It depends on how long each line is. string.split can be a bottle neck on long lines.

jgauffin 2010-08-27 12:54:28

And if there are no linebreaks?

chriszero 2010-08-27 13:02:50

Answer 6

+1 A:

If you want to try something different, you could trying using a BinaryReader, and read the stream byte by byte, and increase a counter by one everytime you encounter a space.

horsedrowner 2010-08-27 12:14:33

Answer 7

+1 A:

Hundred thousand records are not so much. I would use a Dictionary<string,int>. To store the key and the count.

But if you run into memory issues, why not use a database, even an database like SQL Compact or SQLite. Create a table with a record containing the key and the count.

Keeping the data in memory is the fastest for small amounts of data, but when you reach your computer memory limits, a database will be faster.

GvS 2010-08-27 12:21:28

Answer 8

A:

    static string LETTERS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
    static string NUMBERS = "1234567890";
    static Random rdGen = new Random();
    static Dictionary<string, int> myDic = new Dictionary<string, int>();
    static void WriteTest(int max)
    {
        myDic = new Dictionary<string, int>();
        Stopwatch sw = new Stopwatch();
        sw.Start();
        for (int i = 0; i < max; i++)
        {
            string code = LETTERS[rdGen.Next(0, 26)].ToString() + NUMBERS[rdGen.Next(0, 10)].ToString() + LETTERS[rdGen.Next(0, 26)].ToString() + LETTERS[rdGen.Next(0, 26)].ToString();
            if (myDic.ContainsKey(code)) myDic[code]++;
            else
            {
                myDic[code] = 1;
            }
        }
        sw.Stop();
        Console.WriteLine(max.ToString() + " itérations : " + sw.ElapsedMilliseconds.ToString());

    }

WriteTest(10000000); // Takes 7.5 seconds.

It seems pretty efficient for me.

Jean-Philippe Gire 2010-08-27 15:20:17

ansaurus

tags:

views:

answers:

Efficiently parsing a large text file in C#

related questions