views:

1424

answers:

12

I am trying to compare two large datasets from a SQL query. Right now the SQL query is done externally and the results from each dataset is saved into its own csv file. My little C# console application loads up the two text/csv files and compares them for differences and saves the differences to a text file.

Its a very simple application that just loads all the data from the first file into an arraylist and does a .compare() on the arraylist as each line is read from the second csv file. Then saves the records that don't match.

The application works but I would like to improve the performance. I figure I can greatly improve performance if I can take advantage of the fact that both files are sorted, but I don't know a datatype in C# that keeps order and would allow me to select a specific position. Theres a basic array, but I don't know how many items are going to be in each list. I could have over a million records. Is there a data type available that I should be looking at?

A: 

Well, there are several approaches that would work. You could write your own data structure that did this. Or you can try and use SortedList. You can also return the DataSets in code, and then use .Select() on the table. Granted, you would have to do this on both tables.

MagicKat
MaxGeek could load only the first file into a dataset then iterate through the second file and search just 1 dataset using .Select(). Or did I miss something?
Sam
A contains 1,3,5 B contains 1,2,3,4. If you only compare one direction, i.e. looping through A checking in B, you will not find 2 and 4 in B. You have to loop through B as well to see what it has that A doesn't
MagicKat
Thanks for the clarification MagicKat
Sam
A: 

You can easily use a SortedList to do fast lookups. If the data you are loading is already sorted, insertions into the SortedList should not be slow.

Sam
A: 

If you are looking simply to see if all lines in FileA are included in FileB you could read it in and just compare streams inside a loop.

File 1 Entry1 Entry2 Entry3

File 2 Entry1 Entry3

You could loop through with two counters and find omissions, going line by line through each file and see if you get what you need.

Mitchel Sellers
A: 

Maybe I misunderstand, but the ArrayList will maintain its elements in the same order by which you added them. This means you can compare the two ArrayLists within one pass only - just increment the two scanning indices according to the comparison results.

Arno
A: 

One question I have is have you considered "out-sourcing" your comparison. There are plenty of good diff tools that you could just call out to. I'd be surprised if there wasn't one that let you specify two files and get only the differences. Just a thought.

ShaneC
+1  A: 

System.Collections.Specialized.StringCollection allows you to add a range of values and, using the .IndexOf(string) method, allows you to retrieve the index of that item.

That being said, you could likely just load up a couple of byte[] from a filestream and do byte comparison... don't even worry about loading that stuff into a formal datastructure like StringCollection or string[]; if all you're doing is checking for differences, and you want speed, I would wreckon byte differences are where it's at.

cranley
+2  A: 

If data in both of your CSV files is already sorted and have the same number of records, you could skip the data structure entirely and do in-place analysis.

StreamReader one = new StreamReader("C:\file1.csv");
StreamReader two = new StreamReader("C:\file2.csv");
String lineOne;
String lineTwo;

StreamWriter differences = new StreamWriter("Output.csv");
while (!one.EndOfStream)
{
    lineOne = one.ReadLine();
    lineTwo = two.ReadLine();
    // do your comparison.
    bool areDifferent = true;

    if (areDifferent)
        differences.WriteLine(lineOne + lineTwo);
}

one.Close();
two.Close();
differences.Close();
David Sokol
That's what I've done before on CSV files that are extremely large 10 million records+ and I've achieved satisfactory performance. On the order of less then a minute.
JP
What about if the files have a different number of lines in them?
Garry Shutler
If the CSV files have a key or ID field, you could compare those, then decide to read an additional line from one or the other.
David Sokol
A: 

I think the reason everyone has so many different answers is that you haven't quite got your problem specified well enough to be answered. First off, it depends what kind of differences you want to track. Are you wanting the differences to be output like in a WinDiff where the first file is the "original" and second file is the "modified" so you can list changes as INSERT, UPDATE or DELETE? Do you have a primary key that will allow you to match up two lines as different versions of the same record (when fields other than the primary key are different)? Or is is this some sort of reconciliation where you just want your difference output to say something like "RECORD IN FILE 1 AND NOT FILE 2"?

I think the asnwers to these questions will help everyone to give you a suitable answer to your problem.

skb
A: 

If you have two files that are each a million lines as mentioned in your post, you might be using up a lot of memory. Some of the performance problem might be that you are swapping from disk. If you are simply comparing line 1 of file A to line one of file B, line2 file A -> line 2 file B, etc, I would recommend a technique that does not store so much in memory. You could either read write off of two file streams as a previous commenter posted and write out your results "in real time" as you find them. This would not explicitly store anything in memory. You could also dump chunks of each file into memory, say one thousand lines at a time, into something like a List. This could be fine tuned to meet your needs.

Jason Jackson
+1  A: 

This is an adaptation of David Sokol's code to work with varying number of lines, outputing the lines that are in one file but not the other:

StreamReader one = new StreamReader("C:\file1.csv");
StreamReader two = new StreamReader("C:\file2.csv");
String lineOne;
String lineTwo;
StreamWriter differences = new StreamWriter("Output.csv");
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
while (!one.EndOfStream || !two.EndOfStream)
{
  if(lineOne == lineTwo)
  {
    // lines match, read next line from each and continue
    lineOne = one.ReadLine();
    lineTwo = two.ReadLine();
    continue;
  }
  if(two.EndOfStream || lineOne < lineTwo)
  {
    differences.WriteLine(lineOne);
    lineOne = one.ReadLine();
  }
  if(one.EndOfStream || lineTwo < lineOne)
  {
    differences.WriteLine(lineTwo);
    lineTwo = two.ReadLine();
  }
}

Standard caveat about code written off the top of my head applies -- you may need to special-case running out of lines in one while the other still has lines, but I think this basic approach should do what you're looking for.

Jonathan
A: 

Thanks for the replies so far. I haven't had a chance to try out all your suggestions, but I'll let you know how it goes.

Here are some of the answers to the questions asked:

I don't have an equal number of records in each file. Right now I have 900k in one file and 700k in the other.

I have tried using a diff tool like beyond compare, but it can't handle this situation because of the number of lines involved. Beyond compare tries to match up lines based on the same text, but it fails in this situation and ends up doing more of a line to line comparison and the files aren't equal.

Right now I am outputting 2 files. Each file contains the records not found in the other file. This will probably change later.

What I am doing: I have 2 databases. One we will call MrIBM and the other MrSQL. MrIBM is the master record system. MrSQL is the database for another application. Some specific data from MrIBM is loaded into MrSQL. Now I want to find out if correct data was sent from MrIBM to MrSQL, or if we forgot to send something or if MrSQL has something its not suppose to. So I created 1 query to get the data out of MrIBM that is suppose to be in MrSQL. I created another query in MrSQL to get the data that it has. MrIBM and MrSQL don't have the same tables or fields so two different queries had to be made.

Now I have 2 csv files from these queries. They match up in a format like: itemid,itmdesc,attr1,attr2,attr3,etc

I have 2 goals: 1) find records in one system but not the other, 2) find records where the itemid matchs, but other fields don't.

MaxGeek
A: 

To resolve question #1 I'd recommend looking into creating a hash of each line. That way you can compare hashes quick and easy using a dictionary.

To resolve question #2 one quick and dirty solution would be to use an IDictionary. Using itemId as your first string type and the rest of the line as your second string type. You can then quickly find if an itemId exists and compare the lines. This of course assumes .Net 2.0+

ShaneC