tags:

views:

1817

answers:

7

I have a basic C# console application that reads a text file (CSV format) line by line and puts the data into a HashTable. The first CSV item in the line is the key (id num) and the rest of the line is the value. However I've discovered that my import file has a few duplicate keys that it shouldn't have. When I try to import the file the application errors out because you can't have duplicate keys in a HashTable. I want my program to be able to handle this error though. When I run into a duplicate key I would like to put that key into a arraylist and continue importing the rest of the data into the hashtable. How can I do this in C#

Here is my code:


private static Hashtable importFile(Hashtable myHashtable, String myFileName) {

        StreamReader sr = new StreamReader(myFileName);
        CSVReader csvReader = new CSVReader();
        ArrayList tempArray = new ArrayList();
        int count = 0;

        while (!sr.EndOfStream)
        {
            String temp = sr.ReadLine();
            if (temp.StartsWith(" "))
            {
                ServMissing.Add(temp);
            }
            else
            {
                tempArray = csvReader.CSVParser(temp);
                Boolean first = true;
                String key = "";
                String value = "";

                foreach (String x in tempArray)
                {
                    if (first)
                    {
                        key = x;
                        first = false;
                    }
                    else
                    {
                        value += x + ",";
                    }
                }
                myHashtable.Add(key, value);
            }
            count++;
        }

        Console.WriteLine("Import Count: " + count);
        return myHashtable;
    }
+3  A: 

A better solution is to call ContainsKey to check if the key exist before adding it to the hash table instead. Throwing exception on this kind of error is a performance hit and doesn't improve the program flow.

Dror Helper
+10  A: 
if (myHashtable.ContainsKey(key))
    duplicates.Add(key);
else
    myHashtable.Add(key, value);
jop
+3  A: 

ContainsKey has a constant O(1) overhead for every item, while catching an Exception incurs a performance hit on JUST the duplicate items.

In most situations, I'd say check for the key, but in this case, its better to catch the exception.

FlySwat
I may be wrong, but I'm pretty sure checking for the presence of an item in a list is O(N), but for a hash, its O(1).
Matt
Your right, I was thinking of a list for some reason.
FlySwat
+1  A: 

Here is a solution which avoids multiple hits in the secondary list with a small overhead to all insertions:

Dictionary<T, List<K>> dict = new Dictionary<T, List<K>>();

//Insert item
if (!dict.ContainsKey(key))
   dict[key] = new List<string>();
dict[key].Add(value);

You can wrap the dictionary in a type that hides this or put it in a method or even extension method on dictionary.

Morten Christiansen
And yes, I am aware that multiple hits in the secondary list are _very_ unlikely, but it doesn't hurt to be sure :)
Morten Christiansen
A: 

Thank you all. I ended up using the ContainsKey() method. It takes maybe 30 secs longer, which is fine for my purposes. I'm loading about 1.7 million lines and the program takes about 7 mins total to load up two files, compare them, and write out a few files. It only takes about 2 secs to do the compare and write out the files.

MaxGeek
Try using StringBuilder.Append instead of string+ operator and see if it makes it any faster.
jop
+1  A: 

If you have more than 4 (for example) CSV values, it might be worth setting the value variable to use a StringBuilder as well since the string concatenation is a slow function.

woany
+1  A: 

Hmm, 1.7 Million lines? I hesitate to offer this for that kind of load.

Here's one way to do this using LINQ.

CSVReader csvReader = new CSVReader();
List<string> source = new List<string>();
using(StreamReader sr = new StreamReader(myFileName))
{
  while (!sr.EndOfStream)
  {
    source.Add(sr.ReadLine());
  }
}
List<string> ServMissing =
  source
  .Where(s => s.StartsWith(" ")
  .ToList();
//--------------------------------------------------
List<IGrouping<string, string>> groupedSource = 
(
  from s in source
  where !s.StartsWith(" ")
  let parsed = csvReader.CSVParser(s)
  where parsed.Any()
  let first = parsed.First()
  let rest = String.Join( "," , parsed.Skip(1).ToArray())
  select new {first, rest}
)
.GroupBy(x => x.first, x => x.rest)   //GroupBy(keySelector, elementSelector)
.ToList()
//--------------------------------------------------
List<string> myExtras = new List<string>();
foreach(IGrouping<string, string> g in groupedSource)
{
  myHashTable.Add(g.Key, g.First());
  if (g.Skip(1).Any())
  {
    myExtras.Add(g.Key);
  } 
}
David B