At regular intervals we are receiving CSV files from an external source that we have little control over. These files are complete sets of current records; however, any records that have been deleted since the previous are not present. We would like to compare the two files and create a separate file of deleted records so we can do some additional processing on them. In an application in another area we have a commercial sort package (CoSort) that does this out of the box; however, we don't have access to that here. The volumes aren't that large, though, and it seems like this is something that standard or free tools might be able to handle quite easily. Ideally this would take the form of a Windows batch file, but Perl or awk solutions would be okay too. Example input files:
Previous File:
X_KEY,X_NAME,X_ATTRIBUTE
123,Name 123,ATT X
111,Name 111,ATT X
777,Name 777,ATT Y
Incoming File:
X_KEY,X_NAME,X_ATTRIBUTE
777,Name 777,ATT Y
123,Name 123,ATT CHANGED
Resulting File should be at a minimum:
111,Name 111
But if the attributes from the deleted records come through too, that is fine.
So far I have a batch file that uses freeware CMSort to sort the two files minus the header record to make it easier for some type of diff process:
REM Sort Previous File, Skip Header
C:\Software\CMSort\cmsort.exe /H=1 x_previous.txt x_previous_sorted.txt
REM Sort Incoming File, Skip Header
G:\Software\CMSort\cmsort.exe /H=1 x_incoming.txt x_incoming_sorted.txt
But the 'compare and show only the missing records from the first file' bit is eluding me. Part of the complexity is numerous attributes can change among the records that are left, so it isn't a pure diff. However, it feels like a specialized diff command--one that is limited to checking just the key field, not the entire record. I can't seem to get the syntax correct, though. Ideas? Record counts shouldn't exceed 50k records.
Note: If this were SQL and the data were sitting in tables, we could use the EXCEPT operator but moving the data to the database in this case is not an option.