views:

409

answers:

8

Hey Guys,

I am doing some text processing on a unix system. I have access to the command line on this machine and it has Python, Perl and the default text processing progams installed, awk etc.

I have a text file that looks like below:

2029754527851451717 
2029754527851451717 
2029754527851451717 
2029754527851451717 
2029754527851451717 
2029754527851451717 1232453488239 Tue Mar  3 10:47:44 2009
2029754527851451717 1232453488302 Tue Mar  3 10:47:44 2009
2029754527851451717 1232453488365 Tue Mar  3 10:47:44 2009
2895635937120524206 
2895635937120524206 
2895635937120524206 
2895635937120524206 
2895635937120524206 
2895635937120524206 
5622983575622325494 1232453323986 Thu Feb 12 15:57:49 2009

It is basically 3 rows: ID ID Date

I am looking to remove all the lines that do not have 2 ID's and a Date. So the finising results will be like this:

2029754527851451717 1232453488239 Tue Mar  3 10:47:44 2009
2029754527851451717 1232453488302 Tue Mar  3 10:47:44 2009
2029754527851451717 1232453488365 Tue Mar  3 10:47:44 2009
5622983575622325494 1232453323986 Thu Feb 12 15:57:49 2009

How would you guys suggest doing this? In total the text file is around 30,000 lines long.

Cheers

Eef

+1  A: 

With Python:

file = open(filename, 'r')
lines = file.readlines()
file.close()

p = re.compile('^\d*$')

for line in lines:
    if not p.search(line): print line,
kgiannakakis
You probably don't want to read in an entire 30000 line file.
chills42
@chills42: that's dinky. max 60 bytes per line gives 1.7Mb.
ysth
+4  A: 
with open(source_filename) as src:
    with open(dest_filename, 'w') as dst:
        for line in src:
            if len(line.split()) > 1:
                dst.write(line)
James Hopkin
+15  A: 

With awk

 awk 'NF > 2' input_file > output_file
Martin Beckett
the equivalent perl being: perl -i -wane'print if @F > 6' input_file > output_file
ysth
+3  A: 

With Perl:

perl -ne 'print if /^([0-9]+\s+){2}.+$/' $filename
dsm
A: 
sed '/^[0-9]$/d'  filename

(might have to modify the pattern if the bad lines have trailing spaces). You can also use grep -v, which will omit the matched pattern.

Steve B.
+2  A: 

awk "NF>1" < filename

Carlos A. Ibarra
+7  A: 
grep ':' filename
anon
wow, a genius touch!
fengshaun
+2  A: 

Just working my perl here, but this might help as well:

perl -lane 'if (scalar(@F) == 3) { print @F;}' file >> file.out
hyperboreean
useless use of cat.
hillu
True, edited now. Thanks hillu.
hyperboreean
shouldn't that be `scalar(@F) > 3` ?
dsm