I'm trying to learn a bit more about LINQ by implementing Peter Norvig's spelling corrector in C#.
The first part involves taking a large file of words (about 1 million) and putting it into a dictionary where the key
is the word and the value
is the number of occurrences.
I'd normally do this like so:
foreach (var word in allWords)
{
if (wordCount.ContainsKey(word))
wordCount[word]++;
else
wordCount.Add(word, 1);
}
Where allWords
is an IEnumerable<string>
In LINQ I'm currently doing it like this:
var wordCountLINQ = (from word in allWordsLINQ
group word by word
into groups
select groups).ToDictionary(g => g.Key, g => g.Count());
I compare the 2 dictionaries by looking at all the <key, value>
and they're identical, so they're producing the same results.
The foreach
loop takes 3.82 secs and the LINQ query takes 4.49 secs
I'm timing it using the Stopwatch class and I'm running in RELEASE mode. I don't think the performance is bad I was just wondering if there was a reason for the difference.
Am I doing the LINQ query in an inefficient way or am I missing something?
Update: here's the full benchmark code sample:
public static void TestCode()
{
//File can be downloaded from http://norvig.com/big.txt and consists of about a million words.
const string fileName = @"path_to_file";
var allWords = from Match m in Regex.Matches(File.ReadAllText(fileName).ToLower(), "[a-z]+", RegexOptions.Compiled)
select m.Value;
var wordCount = new Dictionary<string, int>();
var timer = new Stopwatch();
timer.Start();
foreach (var word in allWords)
{
if (wordCount.ContainsKey(word))
wordCount[word]++;
else
wordCount.Add(word, 1);
}
timer.Stop();
Console.WriteLine("foreach loop took {0:0.00} ms ({1:0.00} secs)\n",
timer.ElapsedMilliseconds, timer.ElapsedMilliseconds / 1000.0);
//Make LINQ use a different Enumerable (with the exactly the same values),
//if you don't it's suddenly becomes way faster which I assmume is a caching thing??
var allWordsLINQ = from Match m in Regex.Matches(File.ReadAllText(fileName).ToLower(), "[a-z]+", RegexOptions.Compiled)
select m.Value;
timer.Reset();
timer.Start();
var wordCountLINQ = (from word in allWordsLINQ
group word by word
into groups
select groups).ToDictionary(g => g.Key, g => g.Count());
timer.Stop();
Console.WriteLine("LINQ took {0:0.00} ms ({1:0.00} secs)\n",
timer.ElapsedMilliseconds, timer.ElapsedMilliseconds / 1000.0);
}