views:

446

answers:

6

I have some lines in a CSV file like this:

1000001234,Account Name,0,0,"3,711.32",0,0,"18,629.64","22,340.96",COD,"20,000.00",Some string,Some string 2

If you notice, some numbers are enclosed in " " and has a thousand separator ",". I want to remove the thousand separator and the double quote enclosure. For the qoute enclosure, I'm thinking of using string.replace() but how about the comma inside the quote marks?

What's the best way of doing this in Python?

+2  A: 

You could simply parse the CSV, make the necessary changes and then write it again.

(I haven't tested this code but it should be something like this)

import csv
reader = csv.reader(open('IN.csv', 'r'))
writer = csv.writer(open('OUT.csv', 'w')
for row in reader:
 # do stuff to the row here
 # row is just a list of items
 writer.writerow(row)
Dumb Guy
I can't comment on the other post, but if you replace all the commas, you'll also destroy all the CSV commas and it won't be a CSV file anymore.
Dumb Guy
Definitely the way to go. Use csv module in the standard library.
sjthebat
@Dumb Guy, that's why I wanted to remove the commas inside the quote marks and not anywhere else. Thanks for the tip!
Francis
Other than re-writing the CSV file using csv.writer, is there another way?
Francis
You could make it into comma separated string by using `newrow = ','.join(row)`, which should work if you've removed all the commas in the strings.
Dumb Guy
@Francis: You have a file in XXX format. You read it using XXX.reader(). You transform some fields. You need the output in YYY format, so you write it with YYY.writer() -- for **any** values of XXX and YYY. In the case of XXX == YYY, reading/writing in any other fashion (under the (usually mistaken) assumption that you know all the nuances of XXX protocol and so will the poor sod who has to maintain your code) is a step towards the dark side. Don't do it.
John Machin
+1  A: 

If all you want is to remove double quotes and commas from a string, a couple of replaces will do it:

s = s.replace('"','').replace(',','')

A faster way is to use s.translate, but that requires a minimum of preparation:

import string
identity = string.maketrans('', '')

...

s = s.translate(identity, '",')

This removes any occurrence of double quotes or commas, and does it pretty fast too. In general, the .translate method of string objects is the best way to remove certain kinds of characters from a string (as well as possibly performing some character-to-character translation, but, by using a translate table such as the identity one I show here, the translation part may in fact be easily bypassed). Note that .translate works a bit differently for Unicode objects (and therefore for Python 3 strings, too) -- I'm giving the approach that's suitable for plain Python 2 string objects.

Alex Martelli
But this would also remove the occurence of the commas outside of the quote marks, right?
Francis
@Francis, yes, it removes every comma within a field (use the `csv` module to parse the line into fields -- the comma-removal from individual fields is a follow-on step).
Alex Martelli
+1  A: 

Here is a bit of regular expression fiddling that will do the trick:

>>> import re
>>> p = re.compile('["]([^"]*)["]')
>>> x = """1000001234,Account Name,0,0,"3,711.32",0,0,"18,629.64","22,340.96",COD,"20,000.00",Some string,Some string 2"""
>>> p.sub(lambda m: m.groups()[0].replace(',',''), x)
'1000001234,Account Name,0,0,3711.32,0,0,18629.64,22340.96,COD,20000.00,Some string,Some string 2'

Removes the commas from the parts of the string that is between pairs of quotes.

Joel
+1  A: 

Hi,

Here is something I just tested, you may not need pprint, I just want to use for clear output.

test.csv

1000001234,Account Name,0,0,"3,711.32",0,0,"18,629.64","22,340.96",COD,"20,000.00",Some string,Some string 2
1000001234,Account Name,0,0,"3,711.32",0,0,"18,629.64","22,340.96",COD,"20,000.00",Some string,Some string 2

Code, use csv reader, and pass each item to parseNum function to check valid digit or not.

from pprint import pprint
import csv

def parseNum(x):
    xx=x.replace(",","")
    if not xx.replace(".","").isdigit(): return x
    return "." in xx and float(xx) or int(xx)

x=[map(parseNum,line) for line in csv.reader(open("test.csv"))]

pprint(x)

Output

[[1000001234,
  'Account Name',
  0,
  0,
  3711.3200000000002,
  0,
  0,
  18629.639999999999,
  22340.959999999999,
  'COD',
  20000.0,
  'Some string',
  'Some string 2'],
 [1000001234,
  'Account Name',
  0,
  0,
  3711.3200000000002,
  0,
  0,
  18629.639999999999,
  22340.959999999999,
  'COD',
  20000.0,
  'Some string',
  'Some string 2']]

Note: If you need good precision on float numbers, replace float with Decimal

S.Mark
+1  A: 

Use the csv module. It has all sorts of constants and parameters to help you set the delimiters, quotes, and everything else for the type of file you are working with. It even has a Sniffer that can help you identify the csv format of the file. In fact this is the only module I have found that can properly and easily work with csv files.

http://docs.python.org/library/csv.html

ecounysis
+1  A: 

You should absolutely use the csv module. If you use a csv.reader, you only have one very small problem: testing fields to see if they're numbers, and stripping commas if they are. I've packaged it as a generator:

import csv

def read_and_fix_numbers(f):
    """Iterate over a file object that returns CSV data, stripping commas out of numbers."""
    for row in csv.reader(f):
        for field in row:
            try:
                x = float(field)
                field.replace(",", "")
            except ValueError:
                pass
            fixed.append(field)
        yield fixed

Usage:

>>> data = '1000001234,Account Name,0,0,"3,711.32",0,0,"18,629.64","22,340.96",COD,"20,000.00",Some string,Some string 2'
>>> import StringIO
>>> f = StringIO.StringIO(data)
>>> for row in read_and_fix_numbers(f):
        print row
['1000001234', 'Account Name', '0', '0', '3711.32', '0', '0', '18629.64', '22340.96', 'COD', '20000.00', 'Some string', 'Some string 2']
Robert Rossney