tags:

views:

81

answers:

3

Hello, i've written written a code for counting each byte frequency in binary file. Using Linq. Code seem to slow when performing the Linq expression. Its seem hard to implement Parallelism on this kind of logic. To build the freq table over 475MB it took approx 1 mins.

class Program
{
    static void Main(string[] args)
    {
        Dictionary<byte, int> freq = new Dictionary<byte, int>();
        Stopwatch sw = new Stopwatch();


        sw.Start();
        //File Size 478.668 KB
        byte[] ltext = File.ReadAllBytes(@"D:\Setup.exe");
        sw.Stop();

        Console.WriteLine("Reading File {0}", GetTime(sw));




        sw.Start();
        Dictionary<byte, int> result = (from i in ltext
                                     group i by i into g
                                     orderby g.Count() descending
                                     select new { Key = g.Key, Freq = g.Count() })
                                    .ToDictionary(x => x.Key, x => x.Freq);
        sw.Stop();
        Console.WriteLine("Generating Freq Table {0}", GetTime(sw));


        foreach (var i in result)
        {
            Console.WriteLine(i);
        }
        Console.WriteLine(result.Count);
        Console.ReadLine();
    }

    static string GetTime(Stopwatch sw)
    {
        TimeSpan ts = sw.Elapsed;
        string elapsedTime = String.Format("{0} min {1} sec {2} ms",ts.Minutes, ts.Seconds, ts.Milliseconds);
        return elapsedTime;
    }

I've tried to implement non linq solution using few loops, the performance its about the same. Please, any advice to optimize this. Sorry For my bad English

+1  A: 

Why not just

int[] freq = new int[256];
foreach (byte b in ltext)
    freq[b]++;

?

Vlad
@Vlad: the result is not frequency-ordered like his LINQ result is, so the time isn't comparable. Also, it should be `int[] freq` not `int freq[]`.
Chris Schmich
@Chris: thank you for the syntactical hint, my C++ background is still noticeable :) However the OP didn't ask for a sorted table, just for a list of frequencies, so my code fulfils the requirement, and does it faster. Anyway, I would expect that replacing `int` with a `struct` containing byte value and its frequency, and sorting the array by frequency value would still be much faster.
Vlad
@Vlad: agreed that using a `struct` would probably still be faster than the LINQ version, but I would say that frequency-sorted is still a requirement since that's what the original code did. Anyway, +1 to you and Hans since that's what I was thinking anyway :)
Chris Schmich
this is first time i used linq seems i havent read Do's n Don't section of the book. it looks like linq not suitable for computational intesive task.
raziel
@raziel: LINQ is no silver bullet, it just does what you ask it to do. In the scenarios where its usage is appropriate, LINQ is not so bad.
Vlad
+2  A: 

This took a bit over a second on a 442MB file on my otherwise poky Dell laptop:

        byte[] ltext = File.ReadAllBytes(@"c:\temp\bigfile.bin");
        var freq = new long[256];
        var sw = Stopwatch.StartNew();
        foreach (byte b in ltext) {
            freq[b]++;
        }
        sw.Stop();
        Console.WriteLine(sw.ElapsedMilliseconds);

Very hard to beat the raw perf of an array.

Hans Passant
Ouch, good luck picking the answer mark :)
Hans Passant
@Hans: we've picked even the same variable names :-D
Vlad
@Hans: the result is not frequency-ordered like his LINQ result is, so the time isn't comparable.
Chris Schmich
@Chris : add about 300 microseconds for Array.Sort(freq). It drowns in the noise.
Hans Passant
@Hans: in the array, the index is implicitly the key, so if you sort, you've lost your key -> value mapping. Just sorting the array destroys which byte has which count.
Chris Schmich
Sigh, yes, use a struct. It doesn't change the timing.
Hans Passant
@Chris: Well, replace long by struct { byte b; long freq; }.
Vlad
@Hans: +!, sorry, not trying to give you trouble, just wanted to point out to the OP that it's not as simple as just having an `int[]`.
Chris Schmich
I'm surprised this was marked as the answer. The question indicated a file size of 475MB. The answer shown here loads the whole file into memory! Hans used a file size of 442KB - over a thousand times smaller. What if the file is 2GB? Or 8GB? Even if there's a big enough contiguous area in your virtual memory, loading such a large file will almost certainly cause other memory to be paged out. This will adversely affect the performance of the byte count **and** other apps running on the machine.
Mike Scott
@Mike: oops, typo, fixed. The OP isn't measuring the loading, just the processing.
Hans Passant
@Hans, if the loading takes 10 times longer than the processing, what's the point in optimising the processing? ;-) And if the file is too big, it won't load at all. But I'm glad you used a similar-sized file :-))
Mike Scott
@Mike - 10 minutes is rather a lot to read half a gig. Now that he's down to 1 second processing, yes, little point in trying to find additional improvements.
Hans Passant
@Hans, agreed, 10 minutes is crazy long. However, if the file is big enough, reading it into memory will fail completely. Reading in blocks as I answered below will get him under 10 secs cold/about 1 sec on second run when the file is cached AND work with any file size.
Mike Scott
+2  A: 

The following displays the frequency of bytes in descending order in a 465MB file on my machine in under 9 seconds when build in release mode.

Note, I've made it faster by reading the file in 100000 byte blocks (you can experiment with this - 16K blocks made no appreciable difference on my machine). The point is that the inner loop is the one supplying bytes. Calling Stream.ReadByte() is fast but not nearly as fast as indexing a byte in an array.

Also, reading the whole file into memory exerts extreme memory pressure which will hamper performance and will fail completely if the file is large enough.

using System;
using System.Diagnostics;
using System.IO;
using System.Linq;

class Program
{
    static void Main( string[] args )
    {
        Console.WriteLine( "Reading file..." );
        var sw = Stopwatch.StartNew();
        var frequency = new long[ 256 ];
        using ( var input = File.OpenRead( @"c:\Temp\TestFile.dat" ) )
        {
            var buffer = new byte[ 100000 ];
            int bytesRead;
            do
            {
                bytesRead = input.Read( buffer, 0, buffer.Length );
                for ( var i = 0; i < bytesRead; i++ )
                    frequency[ buffer[ i ] ]++;
            } while ( bytesRead == buffer.Length );
        }
        Console.WriteLine( "Read file in " + sw.ElapsedMilliseconds + "ms" );

        var result = frequency.Select( ( f, i ) => new ByteFrequency { Byte = i, Frequency = f } )
            .OrderByDescending( x => x.Frequency );
        foreach ( var byteCount in result )
            Console.WriteLine( byteCount.Byte + " " + byteCount.Frequency );
    }

    public class ByteFrequency
    {
        public int Byte { get; set; }
        public long Frequency { get; set; }
    }
}
Mike Scott