tags:

views:

1510

answers:

5

I have a text file (.txt) which could be in tab separated format or pipe separated format, and I need to convert it into CSV file format. I am using python 2.6. Can any one suggest me how to identify the delimiter in a text file, read the data and then convert that into comma separated file.

Thanks in advance

A: 

Like this

from __future__ import with_statement 
import csv
import re
with open( input, "r" ) as source:
    with open( output, "wb" ) as destination:
        writer= csv.writer( destination )
        for line in input:
            writer.writerow( re.split( '[\t|]', line ) )
S.Lott
1 for for line in input (input is a file path), -1 for not mentioning quoting\escaping the delimiter etc, and -1 for having spaces in dopey places in your code.
John Machin
Quoting/escaping is a non-issue. The input (tab or pipe) can't be meaningfully quoted. Perhaps escaped, but that's rare. The CSV quoting/escaping is handled by `csv`.
S.Lott
About the "dopey places" for whitespace: after 30 years of programming in dozens of languages, there are parts of PEP-8 that I'm just not going to follow.
S.Lott
A: 
for line in open("file"):
    line=line.strip()
    if "|" in line:
        print ','.join(line.split("|"))
    else:
        print ','.join(line.split("\t"))
ghostdog74
-1 for split(). -1 for strip(); presumably the strip() is intended to get rid of a newline; it will do that PLUS it will remove all other trailing whitespace (including tabs); use `line.strip('\n')` instead. -1 for not mentioning quoting/escaping the delimiter (and the quote/escape-char).
John Machin
+6  A: 

I fear that you can't identify the delimiter without knowing what it is. The problem with CSV is, that, quoting ESR:

the Microsoft version of CSV is a textbook example of how not to design a textual file format.

The delimiter needs to be escaped in some way if it can appear in fields. Without knowing, how the escaping is done, automatically identifying it is difficult. Escaping could be done the UNIX way, using a backslash '\', or the Microsoft way, using quotes which then must be escaped, too. This is not a trivial task.

So my suggestion is to get full documentation from whoever generates the file you want to convert. Then you can use one of the approaches suggested in the other answers or some variant.

Edit:

Python provides csv.Sniffer that can help you deduce the format of your DSV. If your input looks like this (note the quoted delimiter in the first field of the second row):

a|b|c
"a|b"|c|d
foo|"bar|baz"|qux

You can do this:

import csv

csvfile = open("csvfile.csv")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)

reader = csv.DictReader(csvfile, dialect=dialect)
for row in reader:
    print row,
# => {'a': 'a|b', 'c': 'd', 'b': 'c'} {'a': 'foo', 'c': 'qux', 'b': 'bar|baz'}
# write records using other dialect
I don't get the issue. The source is tab or pipe. The output it CSV. What "full documentation" is required?
S.Lott
What happens if tab or pipe are part of a field's content? You must know how the delimiter is escaped to handle this. Just splitting lines on the delimiter is not enough.
Thanks a lot for the answer
A: 

I would suggest taking some of the example code from the existing answers, or perhaps better use the csv module from python and change it to first assume tab separated, then pipe separated, and produce two output files which are comma separated. Then you visually examine both files to determine which one you want and pick that.

If you actually have lots of files, then you need to try to find a way to detect which file is which.
One of the examples has this:

if "|" in line:

This may be enough: if the first line of a file contains a pipe, then maybe the whole file is pipe separated, else assume a tab separated file.

Alternatively fix the file to contain a key field in the first line which is easily identified - or maybe the first line contains column headers which can be detected.

quamrana
A: 

Your strategy could be the following:

  • parse the file with BOTH a tab-separated csv reader and a pipe-separated csv reader
  • calculate some statistics on resulting rows to decide which resultset is the one you want to write. An idea could be counting the total number of fields in the two recordset (expecting that tab and pipe are not so common). Another one (if your data is strongly structured and you expect the same number of fields in each line) could be measuring the standard deviation of number of fields per line and take the record set with the smallest standard deviation.

In the following example you find the simpler statistic (total number of fields)

import csv

piperows= []
tabrows = []

#parsing | delimiter
f = open("file", "rb")
readerpipe = csv.reader(f, delimiter = "|")
for row in readerpipe:
 piperows.append(row)
f.close()

#parsing TAB delimiter
f = open("file", "rb")
readertab = csv.reader(f, delimiter = "\t")
for row in readerpipe:
 tabrows.append(row)
f.close()

#in this example, we use the total number of fields as indicator (but it's not guaranteed to work! it depends by the nature of your data)
#count total fields
totfieldspipe = reduce (lambda x,y: x+ y, [len(f) for f in piperows])
totfieldstab = reduce (lambda x,y: x+ y, [len(f) for f in tabrows])

if totfieldspipe > totfieldstab:
 yourrows = piperows
else:
 yourrows = tabrows


#the var yourrows contains the rows, now just write them in any format you like
Mauro Bianchi