views:

376

answers:

4

In short, I have a 20,000,000 line csv file that has different row lengths. This is due to archaic data loggers and proprietary formats. We get the end result as a csv file in the following format. MY goal is to insert this file into a postgres database. How Can I do the following:

  • Keep the first 8 columns and my last 2 columns, to have a consistent CSV file
  • Add a new column to the csv file ether at the first or last position.

1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0 img_id.jpg, -50
+5  A: 

Read a row with csv, then:

newrow = row[:8] + row[-2:]

then add your new field and write it out (also with csv).

Ignacio Vazquez-Abrams
+1  A: 

You can open the file as a textfile and read the lines one at a time. Are there quoted or escaped commas that don't "split fields"? If not, you can do

with open('thebigfile.csv', 'r') as thecsv:
    for line in thecsv:
        fields = [f.strip() for f in thecsv.split(',')]
        consist = fields[:8] + fields[-2:] + ['onemore']
        ... use the `consist` list as warranted ...

I suspect that where I have + ['onemore'] you may want to "add a column", as you say, with some very different content, but of course I can't guess what it might be.

Don't send each line separately with an insert to the DB -- 20 million inserts would take a long time. Rather, group the "made-consistent" lists, appending them to a temporary list -- each time that list's length hits, say, 1000, use an executemany to add all those entries.

Edit: to clarify, I don't recommend using csv to process a file you know is not in "proper" csv format: processing it directly gives you more direct control (especially as and when you discover other irregularities beyond the varying number of commas per line).

Alex Martelli
There is many great advice in this answer. I don't know why there is a downvote. +1ed
S.Mark
Hmm.. Does seem to have the problem that a line with a quoted comma would cause errors.
Charles Merriam
@Charles, sure, that's why I say "If not, ...";-). If you need a more complex "grammar" to parse that "imperfect csv" and extract the fields, you can incrementally apply just as much parsing refinements as needed (the csv module, while good for handling _good_ csv files, just can't have as much breadth of workarounds for imperfections in the incoming data).
Alex Martelli
True. It does. I'ld probably look at SnapLogic were this not a one time import. You did put in all the disclaimers, I just doubt 20MLOD would not have a single quoted string.
Charles Merriam
+1  A: 

I would recommend using the csv module. Here's some code based off CSV processing that I've done elsewhere

from __future__ import with_statement
import csv

def process( reader, writer):
    for line in reader:
        data = row[:8] + row[-2:]
        writer.write( data )

def main( infilename, outfilename ):
    with open( infilename, 'rU' ) as infile:
        reader = csv.reader( infile )
        with open( outfilename, 'w') as outfile:
            writer = csv.writer( outfile )
            process( reader, writer )

if __name__ == '__main__':
    if len(sys.argv) != 3:
        print "syntax: python process.py filename outname"
        sys.exit(1)
    main( sys.argv[1], sys.argv[2] )
Seth Johnson
Out of Curiosity, why did you use with instead of for?
dassouki
@dassouki: The `for` is in `process()`.
Ignacio Vazquez-Abrams
+1  A: 

Sorry, you will need to write some code with this one. When you have a huge file like this, it's worth checking all of it to be sure it's consistent with what you expect. If you let the unhappy data into your database, you will never get all of it out.

Remember oddities about CSV: it's a mishmash of a bunch of similar standards with different rules about quoting, escaping, null characters, unicode, empty fields (",,,"), multi-line inputs, and blank lines. The csv module has 'dialects' and options, and you might find the csv.Sniffer class helpful.

I recommend you:

  • run a 'tail' command to look at the last few lines.
  • if it appears well behaved, run the whole file through csv reader to see it breaks. Make a quick histogram of "fields per line".
  • Think about "valid" ranges and character types and rigorously check them as you read. Especially watch for unusual unicode or characters outside of the printable range.
  • Seriously consider if you want to keep the extra, odd-ball values in a "rest of the line" text field.
  • Toss any unexpected lines into an exception file.
  • Fix up your code to handle the new pattern in exceptions file. Rinse. Repeat.
  • Finally, run the whole thing again, actually dumping data into the database.

Your development time will be faster from not touching a database until you are completely done. Also, be advised the SQLite is blazingly fast on read only data, so PostGres might not be the best solution.

Your final code will probably look like this, but I can't be sure without knowing your data, especially how 'well behaved' it is:

while not eof
    out = []
    for chunk in range(1000):
       try:
          fields = csv.reader.next()
       except StopIteration:
          break
       except:
          print str(reader.line_num) + ", 'failed to parse'"
       try:
          assert len(fields) > 5 and len(fields < 12)
          assert int(fields[3]) > 0 and int(fields[3]) < 999999
          assert int(fields[4]) >= 1 and int(fields[4] <= 12) # date
          assert field[5] == field[5].strip()  # no extra whitespace
          assert not field[5].strip(printable_chars)  # no odd chars
          ...
       except AssertionError:
          print str(reader.line_num) + ", 'failed checks'"
       new_rec = [reader.line_num]  # new first item
       new_rec.extend(fields[:8])   # first eight
       new_rec.extend(fields[-2:])  # last two
       new_rec.append(",".join(field[8:-2])) # and the rest
       out.append(new_rec)
    if database:
       cursor.execute_many("INSERT INTO raw_table VALUES %d,...", out)

Of course, your mileage my vary with this code. It's a first draft of pseduo-code. Expect writing solid code for the input to take most of a day.

Charles Merriam