views:

480

answers:

8

Hi,

I've got a shell script outputting data like this:

1234567890  *
1234567891  *

I need to remove JUST the last three characters " *". I know I can do it via

(whatever) | sed 's/\(.*\).../\1/'

But I DON'T want to use sed for speed purposes. It will always be the same last 3 characters.

Qny quick way of cleaning up the output?

A: 

You can use awk just to print the first 'field' if there won't be any spaces (or if there will be, change the separator'.

I put the fields you had above into a file and did this

awk '{ print $1 }' < test.txt 
1234567890
1234567891

I don't know if that's any better.

Shawn D.
Thanks sdeer - but dont want to use awk either... _really_ need the speed on this one...
RubiCon10
+1  A: 

Both awk and sed are plenty fast, but if you think it matters feel free to use one of the following:

If the characters that you want to delete are always at the end of the string

echo '1234567890  *' | tr -d ' *'

If they can appear anywhere within the string and you only want to delete those at the end

echo '1234567890  *' | rev | cut -c 4- | rev

The man pages of all the commands will explain what's going on.

I think you should use sed, though.

majhool
Hi majhool - see comment above to paxdiablo - will tr be faster or the rev/cut/rev combination?
RubiCon10
tr. You could also use the venerable cut -d ' ' -f 1 if the tr line will work for you. Though again, I'd recommend sed, since this will break if you end up with more spaces than you expect.
majhool
A: 

what do you mean don't want to use sed/awk for speed purposes? sed/awk are faster than the shell's while read loop for processing files.

$ sed 's/[ \t]*\*$//' file
1234567890
1234567891

$ sed 's/..\*$//' file
1234567890
1234567891

with bash shell

while read -r a b
do
 echo $a
done <file
ghostdog74
+6  A: 

I can guarantee you that bash alone won't be any faster than sed for this task. Starting up external processes in bash is a generally bad idea but only if you do it a lot.

So, if you're starting a sed process for every line in your input, I'd be concerned. But you're not. You only need to start one sed which will do all the work for you.

You may however find that the following sed will be a bit faster than your version:

(whatever) | sed 's/...$//'

All this does is remove the last three characters on each line, rather than substituting the whole line with a shorter version of itself. Now maybe more modern RE engines can optimise your command but why take the risk.

To be honest, about the only way I can think of that would be faster would be to hand-craft your own C-based filter program. And the only reason that may be faster than sed is because you can take advantage of the extra knowledge you have on your processing needs (sed has to allow for generalised procession so may be slower because of that).

Don't forget the optimisation mantra: "Measure, don't guess!"


If you really want to do this one line at a time in bash (and I still maintain that it's a bad idea), you can use:

pax> line=123456789abc
pax> line2=${line%%???}
pax> echo ${line2}
123456789
pax> _

You may also want to investigate whether you actually need a speed improvement. If you process the lines as one big chunk, you'll see that sed is plenty fast. Type in the following:

#!/usr/bin/bash

echo This is a pretty chunky line with three bad characters at the end.XXX >qq1
for i in 4 16 64 256 1024 4096 16384 65536 ; do
    cat qq1 qq1 >qq2
    cat qq2 qq2 >qq1
done

head -20000l qq1 >qq2
wc -l qq2

date
time sed 's/...$//' qq2 >qq1
date
head -3l qq1

and run it. Here's the output on my (not very fast at all) R40 laptop:

pax> ./chk.sh
20000 qq2
Sat Jul 24 13:09:15 WAST 2010

real    0m0.851s
user    0m0.781s
sys     0m0.050s
Sat Jul 24 13:09:16 WAST 2010
This is a pretty chunky line with three bad characters at the end.
This is a pretty chunky line with three bad characters at the end.
This is a pretty chunky line with three bad characters at the end.

That's 20,000 lines in under a second, pretty good for something that's only done every hour.

paxdiablo
... actually yes, this is for every line of input say ~200 lines in one case and ~20000 in another... run every 5 mins and 60 mins for the second...
RubiCon10
_Why_ are you doing it for every line? If you want to process 200/20K lines, do it once (with one `sed`). That `sed` command will breeze through 20K lines in under a second on my crappy old IBM ThinkPad R40.
paxdiablo
Wow! Nice script! `for i in {1..20000}; do echo "line of text...XXX"; done | time -p sed 's/...$//' >/dev/null` - eliminating `cat` and `head` and at least one file. Even if you added back in some of the diagnostic output, "for i in 4 16 64 256 1024 4096 16384 65536" *really*? This is the same: `for i in x x x x x x x x`
Dennis Williamson
I really don't know why people concentrate on throwaway parts of scripts. The bit being questioned is DEBUG code for setting up test data solely for measuring the speed at the request of the OP, and it doesn't matter one bit whether it's done one way or another. The important bit is the _single line containing the_ `sed` _command!_
paxdiablo
Hi pax - thanks for your input - I did rework the code to generate all the lines at once... and then did speed measurements as you suggested!! And I found that the cut -d statement was faster than sed in this case (it took half the time).
RubiCon10
I should add that although Larry's answer is the one i ended up using, its your suggestion of actual measurement that made me reworkd the code and save quite a bit of time this method! So - thank you! :)
RubiCon10
No problems, @RubiCon10, I still got 50 rep out of it :-) If the other answer is better (faster in your particular case), then you _should_ accept it.
paxdiablo
+1  A: 

Note: This answer is somewhat intended to be a joke, but it actually does work...

#!/bin/bash
outfile="/tmp/$RANDOM"
cfile="$outfile.c"
echo '#include <stdio.h>
int main(void){int e=1;char c;while((c=getc(stdin))!=-1){if(c==10)e=1;if(c==32)e=0;if(e)putc(c,stdout);}}' >> "$cfile"
gcc -o "$outfile" "$cfile"
rm "$cfile"
cat somedata.txt | "$outfile"
rm "$outfile"

You can replace cat somedata.txt with a different command.

icktoofay
This actually would be quite fast if it didn't compile it each time, I believe.
icktoofay
Replace `gcc` with `tcc` (tinyc) and I'm pretty sure the 'compiling' overhead will be trivial. Especially if you never write a binary but have `tcc` run it immediately.
R..
+1  A: 

Assuming all data is formatted like your example, use 'cut' to get the first column only.

cat $file | cut -d ' ' -f 1  

or to get the first 10 chars.

cat $file | cut -c 1-10
Larry Wang
@Larry, you beat me by 17 seconds ... but we were both beat by majhool's comment 40 minutes earlier
Zac Thompson
@Zac: You're right. I saw that his answer looked clunky, but didn't read the comment underneath. He didn't mention selecting only the first 10 chars per line though.
Larry Wang
+1  A: 

If the script always outputs lines of 10 characters followed by 3 extra (in other words, you just want the first 10 characters), you can use

script | cut -c 1-10

If it outputs an uncertain number of non-space characters, followed by a space and then 2 other extra characters (in other words, you just want the first field), you can use

script | cut -d ' ' -f 1

... as in majhool's comment earlier. Depending on your platform, you may also have colrm, which, again, would work if the lines are a fixed length:

script | colrm 11
Zac Thompson
A: 

Another answer relies on the third-to-last character being a space. This will work with (almost) any character in that position and does it "WITHOUT using sed, or perl, etc.":

while read -r line
do
    echo ${line:0:${#line}-3}
done

If your lines are fixed length change the echo to:

echo ${line:0:9}

or

printf "%.10s\n" "$line"

but each of these is definitely much slower than sed.

Dennis Williamson