views:

242

answers:

8

I need to read a large space-seperated text file and count the number of instances of each code in the file. Essentially, these are the results of running some experiments hundreds of thousands of times. The system spits out a text file that looks kind of like this:

A7PS A8PN A6PP23 ...

And there are literally hundreds of thousands of these entries and I need to count the occurances of each of the codes.

I guess I could just open a StreamReader and go through line by line, splitting on the space character. Seeing if the code has already been encountered and adding 1 to the count of that code. However, that is probably pretty naive, given the size of the data.

Anyone know of an efficient algorithm to handle this sort of processing?

UPDATE :

OK, so the consensus seems to be my approach is along the right lines

What I'd be interested to hear are things like - which is more efficient - StreamReader. TextReader, BinaryReader

What is the best structure to store my dictionary of results? HashTable, SortedList, HybridDictionary

If there are no line breaks ion the file (I haven't been given a sample yet) will just splitting the whole thing on a space be inefficient?

Essentially, I am looking at making it as performant as possible

thanks again

+2  A: 

I would say that in general your approach is right but there is scope for parallelism. I would suggest that you start multiple threads or tasks (in .NET 4) each parsing part/chunk of file. Also instead of reading line by line, read in chunk of bytes - will give better performance from disk IO perspective.

Edit: Here's the outline of solution.

  1. Let's say we will process M chunks of N characters at the time (because we want to limit amount of memory needed and number of threads used).
  2. Allocate N*M character buffer. We will use this buffer cyclically.
  3. Will use producer-consumer pattern. Producer will fill the buffer. It will try to find word boundary near chunk boundary (i.e. near every Nth character). So we will have M chunks of approx N characters with start and end index within the buffer
  4. Now launch M worker threads to process each chunk. Each worker will use its own dictionary to count words - this will eliminate need for thread synchronization.
  5. Will aggregate the results at end of iteration. The process needs to be repeated till entire file is read.

Of course, I am assuming really huge files for taking this approach. I will probably use old style character lookup in buffer to find word boundary marking lookup code as unsafe to avoid bound checks.

VinayC
but make sure you don't split a token
Scoregraphic
Of course - its a bit difficult solution. Will edit my response to outline it.
VinayC
A: 

At a very basic level, I'd start with a Dictionary<string, int>, string.split the document on spaces, and keep count via simple parsing of that data.

string.split is a relatively robust method that, and someone will certainly correct me if I'm wrong, was built to use regular expressions and is immensely more complex than what you need for this scenario.

Writing your own split method will likely be a more viable solution than the one in the framework. I suggest using the off-the-shelf version first as described above, then rewrite your own if you determine that performance is an issue.

Ian

Ian P
Taking a look at string.Split in Reflector and there's certainly no regex magic - it actually uses pointers to iterate over the string looking for the delimiters. However, you are right that it might be over complex; the [MSDN](http://msdn.microsoft.com/en-us/library/b873y76a.aspx) page states that it might use a lot of memory and instead use IndexOf to find the delimiters.
Sam
"string.split ... was built to use regular expressions" I would be *stunned* if it was, much more likely that it iterates through the string attempting to match the tokens passed to it. However I have no evidence to back this up.
Binary Worrier
+1  A: 

I agree with the comment by PoweRoy: why not try it out? Maybe there is no problem in practice.

If you do need something else, you could try writing some code that takes a Stream and returns an IEnumerable<string>. It would read characters from its input one at a time - if you need buffering for efficiency you can always wrap the FileStream you are actually giving this code in a BufferStream - and checks if it's a space (or possible an EOL?). If it isn't, it will add the character to a string buffer (perhaps a StringBuilder?), but if it is it will yield return the current string buffer and clear it.

After that you can just foreach over the result of calling this code on the content of the file and you will get the codes from the file one by one.

You could then use some kind of data structure like a Dictionary<string,int> to count the number of occurrances for each code, keeping the code as key, and the count as value. But this step would be same if you read the file line by line and use string.Split to split them on spaces.

peSHIr
A: 

If there are no other restrictions, you have to read through the complete file as you described.

To save the codes and the count you should use a datastructure that allows search and insert in O(log n) time. SortedDictionary will do that in C#.

EDIT:

What is the best structure to store my dictionary of results? HashTable, SortedList, HybridDictionary

Because sorted order seems not to be required a HybridDictionary or a Dictionary will perfom better in most cases. SortedList will probably be the slowest solution, because inserts take O(n). You should do some tests with the different implementations if performance is so important.

rudi-moore
I would go with a `HybridDictionary` (http://msdn.microsoft.com/en-us/library/system.collections.specialized.hybriddictionary.aspx), as (at least we) don't know how many elements there are in the collection at the end
Scoregraphic
You are right. Edited the answer.
rudi-moore
+4  A: 

Your approach looks fine.

  1. Read in line per line
  2. Split each line by space
  3. Add a record to a dictionary if it doesn't exist yet and if it does exist, do the value++
Carra
It depends on how long each line is. string.split can be a bottle neck on long lines.
jgauffin
And if there are no linebreaks?
chriszero
+1  A: 

If you want to try something different, you could trying using a BinaryReader, and read the stream byte by byte, and increase a counter by one everytime you encounter a space.

horsedrowner
+1  A: 

Hundred thousand records are not so much. I would use a Dictionary<string,int>. To store the key and the count.

But if you run into memory issues, why not use a database, even an database like SQL Compact or SQLite. Create a table with a record containing the key and the count.

Keeping the data in memory is the fastest for small amounts of data, but when you reach your computer memory limits, a database will be faster.

GvS
A: 
    static string LETTERS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
    static string NUMBERS = "1234567890";
    static Random rdGen = new Random();
    static Dictionary<string, int> myDic = new Dictionary<string, int>();
    static void WriteTest(int max)
    {
        myDic = new Dictionary<string, int>();
        Stopwatch sw = new Stopwatch();
        sw.Start();
        for (int i = 0; i < max; i++)
        {
            string code = LETTERS[rdGen.Next(0, 26)].ToString() + NUMBERS[rdGen.Next(0, 10)].ToString() + LETTERS[rdGen.Next(0, 26)].ToString() + LETTERS[rdGen.Next(0, 26)].ToString();
            if (myDic.ContainsKey(code)) myDic[code]++;
            else
            {
                myDic[code] = 1;
            }
        }
        sw.Stop();
        Console.WriteLine(max.ToString() + " itérations : " + sw.ElapsedMilliseconds.ToString());

    }

WriteTest(10000000); // Takes 7.5 seconds.

It seems pretty efficient for me.

Jean-Philippe Gire