ansaurus

Question

fastest way convert tab-delimited file to csv in linux

Answer 1

+2 A:

If all you need to do is translate all tab characters to comma characters, tr is probably the way to go.

The blank space here is a literal tab:

$ echo "hello   world" | tr "\\t" ","
hello,world

Of course, if you have embedded tabs inside string literals in the file, this will incorrectly translate those as well; but embedded literal tabs would be fairly uncommon.

Mark Rushakoff 2010-03-29 00:58:18

More common are embedded commas in the source, which then require wrapping with quotes. Which is troublesome if there are embedded quotes...

kibibu 2010-03-29 01:09:15

Thanks for the `tr` suggestion. How does it compare to `sed` with speed? Suppose you wanted to skip the header start at line number x and continue to the rest of the file. Is there a way to implement this with `tr`? (I should also clarify that there are no embedded commas in the file.)

andrewj 2010-03-29 01:10:04

@andrewj: `tr` should be much faster, as it's just doing character-by-character replacement instead of regex matching. As for skipping header, the easiest thing is to just process in two chunks - if you know the length, `head -n <length> input > output; tail -n +<length+1> input | tr ... >> output`; if you don't know the length, probably something with `grep -n`...

Jefromi 2010-03-29 01:13:37

@andrew, sed has support for transliteration, also you can use address range.

ghostdog74 2010-03-29 01:37:47

Answer 2

+4 A:

If you're worried about embedded commas then you'll need to use a slightly more intelligent method. Here's a Python script that takes TSV lines from stdin and writes CSV lines to stdout:

import sys
import csv

tabin = csv.reader(sys.stdin, dialect=csv.excel_tab)
commaout = csv.writer(sys.stdout, dialect=csv.excel)
for row in tabin:
  commaout.writerow(row)

Run it from a shell as follows:

python script.py < input.tsv > output.csv

Ignacio Vazquez-Abrams 2010-03-29 01:21:57

Unless you know for sure that there are no embedded commas and no embedded tabs, this is a very reliable way to do it. Even though it probably doesn't meet the criteria for being 'the fastest'.

dave 2010-03-29 03:00:49

Answer 3

+1 A:

assuming you don't want to change header and assuming you don't have embedded tabs

# cat file
header  header  header
one     two     three

$ awk 'NR>1{$1=$1}1' OFS="," file
header  header  header
one,two,three

NR>1 skips the first header. you mentioned you know how many lines of header, so use the correct number for your own case. with this, you also do not need to call any other external commands. just one awk command does the job.

another way if you have blank columns and you care about that.

awk 'NR>1{gsub("\t",",")}1' file

using sed

sed '2,$y/\t/,/' file #skip 1 line header and translate (same as tr)

ghostdog74 2010-03-29 01:22:11

Answer 4

A:

sed -e 's/"/\\"/g' -e 's/<tab>/","/g' -e 's/^/"/' -e 's/$/"/' infile > outfile

Damn the critics, quote everything, CSV doesn't care.

is the actual tab character. \t didn't work for me. In bash, use ^V to enter it.

Will Hartung 2010-03-29 01:44:41

Answer 5

+1 A:

perl -lpe 's/"/""/g; s/^|$/"/g; s/\t/","/g' < input.tab > output.csv

Perl is generally faster at this sort of thing than the sed, awk, and Python.

pabs 2010-03-30 06:28:02

ansaurus

tags:

views:

answers:

fastest way convert tab-delimited file to csv in linux

related questions