tags:

views:

195

answers:

6

I have a large file named CHECKME which is tab delimited. There are 8 columns in each row. Column 4 is integers.

By using Perl or Python, is it possible to verify that each row in CHECKME has 8 columns and that column 4 is an integer?

+4  A: 

In Python:

def isfileok(filename):
  f = open(filename)
  for line in f:
    pieces = line.split('\t')
    if len(pieces) != 8:
      return False
    if not pieces[3].isdigit():
      return False
  return True

I assume that by "column no. 4" you mean the 4th one, hence the [3] since Python (like most computer languages) indices from 0.

Here I'm just returning a boolean result, but I split up the code so it's easy to give good diagnostics about what line is wrong, and how, if you so desire.

Alex Martelli
+2  A: 

In Perl:

while (<>) {
  if (! /^[^\t]+\t[^\t]+\t[^\t]+\t\d+\t[^\t]+\t[^\t]+\t[^\t]+\t[^\t]+$/) {
    die "Line $. is bad: $_";
  }
}

Checks to see that the line starts with one or more non-tabs, followed by a tab, followed by one or more non-tabs, followed by a tab, followed by one or more non-tabs, followed by a tab, followed by one or more digits, etc. until the eighth set of non-tab(s), which must be at the end of the line.

Thats the quick and dirty solution, but in the long run, it'd probably be better to use a "split /\t/" and count the number of fields it gets and then check to make sure field 3 (zero origin) is just digits. That way when (not if) the requirements change and you now need 9 fields, with the 9th field being a prime number, it's easy to make the change.

khearn
Wow, that's a really long regex.
brian d foy
[^\t]+\t[^\t]+\t can be replaced with (?:[^\t]+\t){2}, extrapolate for more
Alexandr Ciornii
A: 
for n,line in enumerate(open("filename")):
    line=line.split()
    if len(line)!=8: 
        print "line %d error" % n+1        
    if not line[3].isdigit(): 
        print "line %d error" % n+1
ghostdog74
+8  A: 

In Perl

while(<>) {
    my @F=split/\t/;
    die "Invalid line: $_" if @F!=8 or $F[3]!~/^-?\d+$/;
}
mobrule
Come on now, use some whitespace, there are *Python* guys watching!
Ether
This is beautiful :-)
Rahul
And the first two lines have command-line flags, so this is really a one-liner! Toss in $. to help find the offending line. perl -a -F\\t -ne 'die "line $. invalid: $_" if @F!=8 or $F[3]!~/^-?\d+$/'
oylenshpeegul
+4  A: 

It's very easy work for Perl:

perl -F\\t -ane'die"Invalid!"if@F!=8||$F[3]!~/^-?\d+$/' CHECKME
Hynek -Pichi- Vychodil
+1  A: 

validate-input.py

Read files given on the command-line or stdin. Print invalid lines. Return code is zero if there are no errors or one otherwise.

import fileinput, sys

def error(msg, line):
    print >> sys.stderr, "%s:%d:%s\terror: %s" % (
        fileinput.filename(), fileinput.filelineno(), line, msg)
    error.count += 1
error.count = 0

ncol, icol = 8, 3
for row in (line.split('\t') for line in fileinput.input()):
    if len(row) == ncol:
        try: int(row[icol])
        except ValueError:
            error("%dth field '%s' is not integer" % (
                (icol + 1), row[icol]), '\t'.join(row))
    else:
        error('wrong number of columns (want: %d, got: %d)' % (
            ncol, len(row)), '\t'.join(row))

sys.exit(error.count != 0)

Example

$ echo 1 2 3 | python validate-input.py *.txt -
not_integer.txt:2:a b c 1.1 e f g h
    error: 4th field '1.1' is not integer
wrong_cols.txt:3:a  b 
    error: wrong number of columns (want: 8, got: 3)
<stdin>:1:1 2 3
    error: wrong number of columns (want: 8, got: 1)
J.F. Sebastian