tags:

views:

786

answers:

2

I have some data files to import into a database with some "unique" delimiters:

Field Separator (FS): SOH (ASCII character 1)

Record Separator (RS) : STX (ASCII character 2) +’\n’

I'd like to import the files into Postgres using the COPY command but while I can specify a custom field delimiter, it can't handle the record separator.

I can't just strip out the \002 from the data either, because if there is a newline in one of the fields (and there are) it will incorrectly case the COPY to think it is a new record when in fact it is not.

One important thing to note: it's not important that newlines in fields are preserved, it's fine if they are just converted into a space.

With this in mind, I was thinking of using something like "sed" to convert newlines into spaces, then convert \002 into newlines. However, since sed is a line-based tool it doesn't seem to see the newlines at the end of each line and can't do a search/replace on them.

Are there any other unix command-line tools that could do the job?

EDIT: I guess what I'm really asking for is a unix utility that can process a file (perform search/replace) as "binary" without splitting it up into lines

+1  A: 

Could you do multiple passes through the file? Pass 1 converts all \002\n to \002\002 say. Pass 2 could convert all the \n to spaces. Pass 3 can convert all the \002\002 to \n.

Patrick Cuff
Multiple passes isn't the problem - I can do this just by piping several unix commands together. The problem I'm seeing is that "sed" consumes the data on a line-by-line basis which effectively hides the final "\n" from the substitution command - it's untouchable.
Marc Novakowski
+2  A: 

Based on the suggestion given by Patrick, I have been able to do it using Perl:

cat file | perl -pe 's/\002\n/\002\002/g' | perl -pe 's/\n/ /g' | perl -pe 's/\002\002/\n/g'

Marc Novakowski