views:

1522

answers:

5

So I have two .csv files where the first line in file 1 is:

MPID,Title,Description,Model,Category ID,Category Description,Subcategory ID,Subcategory Description,Manufacturer ID,Manufacturer Description,URL,Manufacturer (Brand) URL,Image URL,AR Price,Price,Ship Price,Stock,Condition

The first line from file 2:

Regular Price,Sale Price,Manufacturer Name,Model Number,Retailer Category,Buy URL,Product Name,Availability,Shipping Cost,Condition,MPID,Image URL,UPC,Description

and then rest of every file is filled with info.

As you can see, both files have a common field called MPID (file 1: col 1, file 2: col 9, where the first col is col 1).

I would like to create a new file which will combine these two files by looking at this column (as in: if there is an MPID that is in both files, then in the new file this MPID will appear with both its row from file 1 and its row from file 2). IF one MPID appears only in one file then it should also go into this combined file.

The files are not sorted in any way.

How do I do this on a debian machine with either a shell script or python?

Thanks.

EDIT: Both files dont have commas other than the ones separating the fields.

A: 

It seems that you're trying to do in a shell script, which is commonly done using SQL server. Is it possible to use SQL for that task? For example, you could import both files into mysql, then create a join, then export it to CSV.

galets
+1  A: 

You'll need to look at the join command in the shell. You will also need to sort the data, and probably lose the first lines. The whole process will fall flat if any of the data contains commas. Or you will need to process the data with a CSV-sensitive process that introduces a different field separator (perhaps control-A) that you can use to split fields unambiguously.

The alternative, using Python, reads the two files into a pair of dictionaries (keyed on the common column(s)) and then use a loop to cover all the elements in the smaller of the two dictionaries, looking for matching values in the other. (This is basic nested loop query processing.)

Jonathan Leffler
join works great; but the input files must be sorted on the key. also it can't read arbitrary csv files. especifically, a comma within a quoted field would shift all fieldnums for that record
Javier
@Javier: agreed - which is why I updated my answer to say as much, even without seeing your comment (which was probably in preparation at the same time I was editing).
Jonathan Leffler
+8  A: 
sort -t , -k index1 file1 > sorted1
sort -t , -k index2 file2 > sorted2
join -t , -1 index1 -2 index2 -a 1 -a 2 sorted1 sorted2
Pierre
beware of quoted commas! neither sort nor join obeys any quotation
Javier
+3  A: 

This is the classical "relational join" problem.

You have several algorithms.

  • Nested Loops. You read from one file to pick a "master" record. You read the entire other file locating all "detail" records that match the master. This is a bad idea.

  • Sort-Merge. You sort each file into a temporary copy based on the common key. You then merge both files by reading from the master and then reading all matching rows from the detail and writing the merged records.

  • Lookup. You read one of the files entirely into a dictionary in memory, indexed by the key field. This can be tricky for the detail file, where you'll have multiple children per key. Then you read the other file and lookup the matching records in the dictionary.

Of these, sort-merge is often the fastest. This is done entirely using the unix sort command.

Lookup Implementation

import csv
import collections

index = collections.defaultdict(list)
file1= open( "someFile", "rb" )
rdr= csv.DictReader( file1 )
for row in rdr:
    index[rdr['MPID']].append( row )
file1.close()

file2= open( "anotherFile", "rb" )
rdr= csv.DictReader( file2 )
for row in rdr:
    print row, index[row['MPID']]
file2.close()
S.Lott
Since Python 2.2, I believe, DictReader doesn't implement __getitem__. This was done for speed reasons. So the code:<code>index[rdr['MPID']].append( row )</code>will fail with a AttributeError: DictReader instance has no attribute '__getitem__'
uman
@uman: The results of a DictReader are a first-class `dict` object. With all of the `dict` methods intact.
S.Lott
@S.Lott: perhaps your code was incorrect. I'm VERY rusty with Python, but trying MPID's example with your code results in the AttributeError. See: http://pastebin.com/VJgSfA3uP.S. Anyone know how to embed formatting in comments?
uman
@uman: Rather than do too much in comments, just open a fresh, new question with the code you don't like.
S.Lott
No-- doing so would disassociate the response from the question that prompted it. I haven't downvoted your response because the rest of it is spot on. But your code's wrong.
uman
@uman: What? You have a problem with the code, post the code with the question. A complete "association" with previous questions is trivial. A question can reference another question. What's the problem?
S.Lott
A: 

You could take a look at my FOSS project CSVfix, which is a stream editor for manipulating CSV files. It supports joins, among its other features, and requires no scripting to use.

anon