tags:

views:

198

answers:

9

I have the following data in a Tab delimited file:

_ DATA _

Col1    Col2     Col3     Col4    Col5
blah1   blah2     blah3   4       someotherText
blahA   blahZ     blahJ   2       someotherText1
blahB   blahT     blahT   7       someotherText2
blahC   blahQ     blahL   10      someotherText3

I want to make sure that the data in 4th column of this file is always an integer. I know how to do this in perl

  • Read each line, Store value of 4th column in a variable
  • check if that variable is an integer
  • if above is true, continue the loop
  • else break out of the loop with message saying file data not correct

But how would I do this in a shell script using standard linux/unix filter? My guess would be to use grep, but I am not sure how?

+1  A: 

awk is the tool most naturally suited for parsing by columns:

awk '{if ($4 !~ /^[0-9]+$/) { print "Error! Column 4 is not an integer:"; print $0; exit 1}}' data.txt

As you get more complex with your error detection, you'll probably want to put the awk script in a file and invoke it with awk -f verify.awk data.txt.

Edit: in the form you'd put into verify.awk:

{
    if ($4 !~/^[0-9]+$/)  {
        print "Error! Column 4 is not an integer:"
        print $0
        exit 1
    }
}

Note that I've made awk exit with a non-zero code, so that you can easily check it in your calling script with something like this in bash:

if awk -f verify.awk data.txt; then
     # action for success
else
     # action for failure
fi

You could use grep, but it doesn't inherently recognize columns. You'd be stuck writing patterns to match the columns.

Jefromi
Your script quits on the first failure, so it doesn't report subsequent ones. Also, it misses values like 2.2, 2a and a2
Dennis Williamson
I thought the *point* was to exit on the first failure. If you don't want it to, set a flag instead of exiting, and add `END {if (error_flag) {exit 1}}` to the end. I edited the regex to fix the other problem. (I was for some reason thinking it was a whole-line match)
Jefromi
A: 
cut -f 4 filename

will return the fourth field of each line to stdout.

Hopefully that's a good start, because it's been a long time since I had to do any major shell scripting.

R. Bemrose
Yes, you could do something like `cut -f 4 <file> | egrep -v '^[0-9]+$'`. You'd then have to either capture its output (check if there was a non-integer line) or check the exit status of egrep (check out bash's PIPESTATUS variable, or its pipefail option).
Jefromi
+1  A: 

Edited....

#!/bin/bash

isdigit ()
{
    [ $# -eq 1 ] || return 0

    case $1 in
        *[!0-9]*|"") return 0;;
        *) return 1;;
    esac
}

while read line
do
    col=($line)
    digit=${col[3]}

    if isdigit "$digit"
    then
        echo "err, no digit $digit"
    else
        echo "hey, we got a digit $digit"
    fi
done

Use this in a script foo.sh and run it like ./foo.sh < data.txt

See tldp.org for more info

Steve Lazaridis
This doesn't really address the question. It's for checking if a given string is an integer, not for looking within a file, especially not the fourth column of it.
Jefromi
That's going to be slower than a slow thing.The golden rule is do as little in shell as possible
pixelbeat
Not to start a religious war, but IMHO using bash alone is better than using a mix to awk,cut..etc.. You don't have the startup time/overhead for the shell to launch another external program... bash just uses its built in language this way.
Steve Lazaridis
+6  A: 
cut -f4 data | LANG=C grep -q '[^0-9]' && echo invalid
  • LANG=C for speed
  • -q to quit at first error in possible long file

If you need to strip the first line then use tail -n+2 or you could get hacky and use:

cut -f4 data | LANG=C sed -n '1b;/[^0-9]/{s/.*/invalid/p;q}'
pixelbeat
This is your best bet if your validation is always going to be nice and simple - less overhead than using awk. If your validation is more complex than a single-column integer check, have a look at my answer. An awk script will be more easily extended.
Jefromi
@pixelbeat: on running on the data, prints invalid. im not sure why one liner does not work. did you run it on the data?
shubster
It's reporting invalid because the first line needs to be skipped. Also, love the sed hack.
Jefromi
In the first version above (with `grep`), you can change the `-q` to `-n` and it will print the line number and data value as well as the text "invalid".
Dennis Williamson
A: 

awk is what you need.

I can't upvote yet, but I would upvote Jefromi's answer if I could.

tobypanzer
pixelbeat
awk is defintiely not overkill. on the contrary, using 2 or more tools through chaining to do a simple task like that is overkill.
ghostdog74
A: 

Mind, this may well not be the most efficient compared to iterating through the file with something like perl.

tail +2 x.x | sort -n -k 4 | head -1 | cut -f 4 | egrep "^[0-9]+$"
if [ "$?" == "0" ]
then
    echo "file is ok";
fi

tail +2 gives you all but the first line (since your sample has a header) sort -n -k 4 sorts the file numerically on the 4th column, letters will rise to the top. head -1 gives you the first line of the file cut -f 4 gives you the 4th column, of the first line egrep "^[0-9]+$" checks if the value is a number (integers in this case).

If egrep finds nothing, $? is 1, otherwise it's 0.

There's also:

if [ `tail +2 x.x | wc -l` == `tail +2 x.x | cut -f 4 | egrep "^[0-9]+$" | wc -l` ] then
    echo "file is ok";
fi

This will be faster, requiring two simple scans through the file, but it's not a single pipeline.

Will Hartung
+1  A: 

Sometimes you need it BASH only, because tr, cut & awk behave differently on Linux/Solaris/Aix/BSD/etc:

while read a b c d e ;  do [[ "$d" =~ ^[0-9] ]] || echo "$a: $d not a numer" ;  done < data
dz
So, you’re worried about the portability of *tr*, *cut*, and *awk*, and to get around that you’re using *bash*, eh?
Cirno de Bergerac
`tr` and `awk` definitely both have POSIX standards -- not sure about `cut`, though.
Mark Rushakoff
+1  A: 

Pure Bash:

linenum=1; while read line; do field=($line); if ((linenum>1)); then [[ ! ${field[3]} =~ ^[[:digit:]]+$ ]] && echo "FAIL: line number: ${linenum}, value: '${field[3]}' is not an integer"; fi; ((linenum++)); done < data.txt

To stop at the first error, add a break:

linenum=1; while read line; do field=($line); if ((linenum>1)); then [[ ! ${field[3]} =~ ^[[:digit:]]+$ ]] && echo "FAIL: line number: ${linenum}, value: '${field[3]}' is not an integer" && break; fi; ((linenum++)); done < data.txt
Dennis Williamson
Over twice as fast as **pixelbeat's** answer on my system.
Dennis Williamson
...for small files. I did a test on a larger file and **pixelbeat's** answer, well, **beat** mine by a wide margin. However, mine tells you what line the error is on.
Dennis Williamson
@Dennis: I wanted an answer using standard unix/linux filters
shubster
A: 

@OP, use awk

awk '$4+0<=0{print "not ok";exit}' file
ghostdog74