tags:

views:

80

answers:

2

I have a file I'm trying to parse, and this is how I'm doing it:

var definitions = new Dictionary<int, string>();

foreach (var line in new RirStatFile("delegated-lacnic-latest.txt"))
{
    for (var i = 0; i < line.Range; i ++)
    {
        definitions[line.StartIpAddress + i] = line.Iso3166CountryCode;
    };
};

new RirStatFile(...) returns an IEnumerable<RirStatFileLine>() with a .Count of 4,100 RirStatFileLine objects, where each RirStatFileLine has a .Range whose value is typically between 32768 and 1 million.

Running this as is demonstrated in the snippet above takes about 45 seconds on this pitiful netbook of mine.

EDIT: Dual-core netbook.

Great place to use the new Parallel task library, right? That's what I thought, so I change the code to:

var definitions = new ConcurrentDictionary<int, string>();

Parallel.ForEach(new RirStatFile("delegated-lacnic-latest.txt"), line => 
{
    Parallel.For(0, line.Range, i =>
    {
        definitions[line.StartIpAddress + i] = line.Iso3166CountryCode;
    });
});

And guess what? The program takes 200 seconds!

What gives? Obviously I don't understand something here that's going on. Just for reference, here's RirStatFileLine:

public class RirStatFileLine
{
    public readonly string Iso3166CountryCode;
    public readonly int StartIpAddress;
    public readonly int Range;

    public RirStatFileLine(string line)
    {
        var segments = line.Split('|');

        // Line:         
        //    lacnic|BR|ipv4|143.54.0.0|65536|19900828|assigned
        // Translation:
        //    rir_name|ISO_countryCode|ipVersion|ipAddress|range|dateStamp|blah

        this.Iso3166CountryCode = segments[1];
        this.StartIpAddress =
         BitConverter.ToInt32(IPAddress.Parse(segments[3]).GetAddressBytes(), 0);
        this.Range = int.Parse(segments[4]);
    }
}

And RirStatFile:

public class RirStatFile : IEnumerable<RirStatFileLine>
{
    private const int headerLineLength = 4;

    private readonly IEnumerable<RirStatFileLine> lines;

    public RirStatFile(string filepath)
    {
        this.lines = File.ReadAllLines(filepath)
                         .Skip(RirStatFile.headerLineLength)
                         .Select(line => new RirStatFileLine(line)); 
    }

    public IEnumerator<RirStatFileLine> GetEnumerator()
    {
        return this.lines.GetEnumerator();
    }

    System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
    {
        return this.lines.GetEnumerator();
    }
}
+2  A: 

No surprise here. You are taking some very cheap operation (adding an entry to a dictionary) and wrapping it up in some expensive parallelization code.

You should parallelize computationally expensive code not trivial code.

Also, you use ReadAllLines instead of ReadLines so there's no opportunity for any processing to happen overlapped with reading the file.

MSDN "The ReadLines and ReadAllLines methods differ as follows: When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned; when you use ReadAllLines, you must wait for the whole array of strings be returned before you can access the array. Therefore, when you are working with very large files, ReadLines can be more efficient."

Hightechrider
A: 

The problem here is that your netbook only has a single CPU/core/hardware thread. Paralyzing wont help at all here.

leppie
Maybe not in this specific case, but in general, even with a single core, parallelizing *can* help by, for example, allowing I/O and computation to proceed in parallel (of course using async I/O calls could also achieve that and would probably be more efficient).
Hightechrider
@Hightechrider: While I agree with your statement, the Parallel library likely dont cater for those situations without considerable effort on the part of the author.
leppie