tags:

views:

258

answers:

5

We are in the middle of changing web store platform and we need to import products' data from different sources.

We currently have several different csv files from different it systems/databases because each system is missing some information. Fortunatly the product ids are the same so it's possible to relate the data using ids.

We need to merge this data into one big csv file so we can import in into our new e-commerce site.

My question: is there a general approach when you need to merge csv files with related data into one csv file? Are there any applications or tools that helps you out?

+2  A: 

Yes. Use Python and the csv module.

Read, create the merged set of results, and write.

import csv

products = {}
with open( "one file.csv", "rb" ) as source:
    rdr = csv.DictReader( source )
    for row in rdr:
        products[row['somekeycolumn']] = row # or maybe some rearrangement

with open( 'another file.csv', 'rb' ) as source:
    rdr = csv.DictReader( source )
    for row in rdr:
        if row['adifferentkey'] in products:
            # maybe update?
        else:
            products[row['adifferentkey']] = row # or maybe some rearrangement

with open( 'result.csv', 'wb' ) as target:
    wtr = csv.writer( target ) 
    wtr.writerow( ['the','heading','columns'] )
    for name in sorted( product ) ):
        wtr.writerow( products[name] )

Something like that.

S.Lott
+2  A: 

sure , any programming language/tools that is able to manipulate files can do the job (Python/Perl/PHP/Ruby/awk among others). an example with awk *nix tool, processing 2 csv files using field 1 as the common key

awk -F"," 'FNR==NR{
  d[$1]=$0
  next
}
{
  print d[$1],$0
}' file1 file2
ghostdog74
You can't just use `-F,` to process CSV files in the general case. CSV files can have commas and newlines within the fields, which makes awk (which I happen to like) very unsuitable.
ar
sure, if there are embeddeds commas and such, using just `-F` is not enough. But there is nothing stopping you from using awk as well. there are awk csv parsers around that more or less does a proper job.
ghostdog74
+1  A: 

why not just load each csv file into a common input table and then use SQL to merge into permanent tables from there. Sample input tables:

InputCSVFiles
FileID     int PK auto number/identity
FileName   string
FileHash   string

InputCVSItems
ItemID     int PK auto number/identity
FileID     FK
ProductID
Col1
Col2
...

just load each file into the tables, can use the file hash to prevent duplicate file upload. You can then use SQL to identity duplicates and use SQL to INSERT using SELECTs into the permanent tables as necessary. I'm not sure how many duplicate products there are, but dealing with them in SQL is fairly easy with the self joins, COALESCE(), ROW_NUMBER(), GROUP BY, MIN(), etc..

INSERT INTO Product
        (ProductID, col1, col2, ...)
    SELECT
        ProductID, col1, col2, ...
        FROM InputCVSItems
        ...
KM
+2  A: 

If you use R, you can load the csv files and then perform any query you like as if you were working with tables in a database.

For example, you could use a command like the following for each csv

table1 <- read.csv("D:/table1.csv", header=T, sep=",", na.strings=c(".", "NA", "", "?"))

Then load the sqldf package

library(sqldf)

And perform queries

table3 <- sqldf("SELECT X1, X2, X3 FROM table1 JOIN table2 USING(X1)")

Finally, export the results as a new csv

write.csv(table3 , file = "D:/temp/table3.csv")
gd047
A: 

I wrote TxtSushi to do SQL selects and some other simple transformations on flat files files. For example you can do something like the following:

tssql -table t1 tbl1.csv -table t2 tbl2.csv \
'select * from t1 inner join t2 on t1.id=t2.id'

There are many example scripts here. It is distributed as a package on http://hackage.haskell.org so installation depends on the Haskell Platform.

Keith