ansaurus

Question

Answer 1

+2 A:

It seems to me from your example that you are cleaning up white space from the beginning and end of pipe (|) delimited fields in a text file. If I were to do this, I would change the algorithm to the following:

for each line
    split the line into an array of fields
    remove the leading and trailing white space
    join the fields back back together as a pipe delimited line handling the empty first field correctly.

I would also use a different language such as Perl or Ruby for this.

The advantage of this approach is that the code that cleans up the lines now handles fewer characters for each invocation and should execute much faster even though more invocations are needed.

David Harris 2009-12-01 20:17:14

+1. With your algorithm `awk` would be a better choice.

mouviciel 2009-12-01 20:29:59

Can you suggest an awk command to do this? Sorry not an awk expert

erotsppa 2009-12-01 21:21:57

I tested this algorithm using awk (along the same lines as the awk programs suggested by D. Williamson and levislevis85), and it was a little slower than the OP's sed script, and quite a bit slower than optimized versions of the sed script. So I'm not convinced that this general approach (of splitting the records into fields before pattern substitution) is likely to result in any speedup (regardless of language).

Dan Moulding 2009-12-02 02:49:53

Answer 2

+2 A:

Try changing the first two lines to:

s/[ \t]*|[ \t]*/|/g

Mark Byers 2009-12-01 20:21:15

My testing has found no difference between this and doing them separately.

Dennis Williamson 2009-12-01 21:18:52

My testing has found this to decrease the time it takes to parse a 250MB file. I also found a further decrease by eliminating the `g` options where it's not needed.

Dan Moulding 2009-12-02 14:53:38

Answer 3

+1 A:

I think the 3rd line doesn't need a g at the end; the end-of-line occurs only once in the line :)

Carl Smotricz 2009-12-01 20:35:22

The last line does not need it either. The beginning of line only occurs once.

Kevin Panko 2009-12-01 22:35:17

Answer 4

A:

Try doing it in one command:

sed 's/[^|]*(|.*|).*/\1/'

static_rtti 2009-12-01 20:39:40

Has no effect at all.

Dennis Williamson 2009-12-01 22:00:02

Can you provide some test data? It's hard to write a regexp correctly without testing :)

static_rtti 2009-12-02 13:50:20

Answer 5

A:

Have you tried Perl? It may be faster.

#!/usr/local/bin/perl -p

s#[\t ]+\|#|#g;
s#\|[\t ]+#|#g;
s#[\t ]*$##;
s#^\|#null|#;

Edit: Actually, it seems to be about three times slower than the sed program. Strange...

Kevin Panko 2009-12-01 21:29:29

Answer 6

+1 A:

You should be able to find how to do it in awk:

See the link below:

http://unixnotes.wordpress.com/2006/03/15/awk-one-liners/

Courtland 2009-12-01 22:02:32

Answer 7

+1 A:

This Perl script should be much much faster

s/\s*|\s*/|/go;
s/\s *$//o;
s/^|/null|/o;

Basically, make sure your regexes are compiled once (the 'o' flag), and no need need to use 'g' on regexes that apply only to end and beginning of line.

Also, [\s\t]* is equivalent to \s*

2009-12-01 22:21:35

The "o" flag is redundant in that context. Perl always compiles a regexp once unless it has a variable inside it.

Kevin Panko 2009-12-01 22:26:18

Answer 8

+1 A:

This might work. I've only tested it a little.

awk  'BEGIN {FS="|"; OFS="|"} {for (i=1; i<=NF; i++) gsub("[ \t]", "", $i); $1=$1; if ( $1 == "" ) $1 = "null"; print}'

Dennis Williamson 2009-12-01 22:23:38

Preliminary testing shows this to execute at about the same speed as the `sed` versions.

Dennis Williamson 2009-12-01 22:35:50

Answer 9

A:

How about Perl:

#!/usr/bin/perl

while(<>) {
 s/\s*\|\s*/|/g;
 s/^\s*//;
 s/\s*$//;
 s/^\|/null|/;
 print;
}

EDIT: changed approach significantly. On my machine this is almost 3x faster than your sed script.

If you really need the best speed possible, write a specialized C program to do this task.

jnylen 2009-12-01 22:52:24

On my machine this is a little slower than the OP's sed script and takes twice as long as a more optimized version of the OP's sed script.

Dan Moulding 2009-12-02 02:58:32

Answer 10

+1 A:

use gawk, not sed.

awk -vFS='|' '{for(i=1;i<=NF;i++) gsub(/ +|\t+/,"",$i)}1' OFS="|"  file

2009-12-01 23:40:52

Answer 11

+1 A:

My testing indicated that sed can become cpu bound pretty easily on something like this. If you have a multi-core machine you can try spawning off multiple sed proc's with a script that looks something like this:

#!/bin/sh
INFILE=data.txt
OUTFILE=fixed.txt
SEDSCRIPT=script.sed
SPLITLIMIT=`wc -l $INFILE | awk '{print $1 / 20}'`

split -d -l $SPLITLIMT $INFILE x_

for chunk in ls x_??
do
  sed -f $SEDSCRIPT $chunk > $chunk.out &
done

wait 

cat x_?? >> output.txt

rm -f x_??
rm -f x_??.out

Drewfer 2009-12-02 01:24:56

Useless use of `ls`. Do this instead: `for chunk in x_??` and globbing is sorted so no need for a loop here: `cat x_??.out > output.txt`

Dennis Williamson 2009-12-02 13:12:11

Edited to inclued Dennis' comments.

Drewfer 2009-12-03 14:35:32

Answer 12

+5 A:

The best I was able to do with sed, was this script:

s/[\s\t]*|[\s\t]*/|/g
s/[\s\t]*$//
s/^|/null|/

In my tests, this ran about 30% faster than your sed script. The increase in performance comes from combining the first two regexen and omitting the "g" flag where it's not needed.

However, 30% faster is only a mild improvement (it should still take about an hour and a half to run the above script on your 1GB data file). I wanted to see if I could do any better.

In the end, no other method I tried (awk, perl, and other approaches with sed) fared any better, except -- of course -- a plain ol' C implementation. As would be expected with C, the code is a bit verbose for posting here, but if you want a program that's likely going to be faster than any other method out there, you may want to take a look at it.

In my tests, the C implementation finishes in about 20% of the time it takes for your sed script. So it might take about 25 minutes or so to run on your Unix server.

I didn't spend much time optimizing the C implementation. No doubt there are a number of places where the algorithm could be improved, but frankly, I don't know if it's possible to shave a significant amount of time beyond what it already achieves. If anything, I think it certainly places an upper limit on what kind of performance you can expect from other methods (sed, awk, perl, python, etc).

Edit: The original version had a minor bug that caused it to possibly print the wrong thing at the end of the output (e.g. could print a "null" that shouldn't be there). I had some time today to take a look at it and fixed that. I also optimized away a call to strlen() that gave it another slight performance boost.

Dan Moulding 2009-12-02 06:26:17

+1 for an impressive amount of work to both implement a faster solution and properly testing it rather than just assuming it will be faster.

Mark Byers 2009-12-02 15:25:37

ansaurus

tags:

views:

answers:

How to make this sed script faster?

related questions