tags:

views:

24

answers:

2

Need to parse a file for lines of data that start with this pattern "Feb 06 2010 15:49:00.017 MCO", where MCO could be any 3 letter ID, and return the entire record for the line. I think I could get the first part, but the returning the rest of the line is where I get lost.

Here is some sample data.

Feb 06 2010 15:49:00.017 MCO -I -I -I -I 0.34 527 0.26 0.24 184 Tentative 0.00 0 Radar Only -RDR- - - - - No 282356N 0811758W - 3-3
Feb 06 2010 15:49:00.017 MLB -I -I -I -I 44.31 3175 -10.05 -10.05 216 Established 0.00 0 Radar Only -RDR- - - - - No 281336N 0812939W - 2-
Feb 06 2010 15:49:00.018 MLB -I -I -I -I 44.31 3175 -10.05 -10.05 216 Established 15.51 99 Radar Only -RDR- - - - - No 281336N 0812939W - 2-
Feb 06 2010 15:49:00.023 QML N856 7437-V -I 62-V 61-V 67.00 3420 -30.93 15.34 534 Established 328.53 129 Reinforced - - - - - - No 283900N 0815325W - -
Feb 06 2010 15:49:00.023 QML N516SP 0723-V -I 22-V 21-V 42.25 3460 -8.19 5.03 146 Established 243.93 83 Beacon Only - - - - - - No 282844N 0812734W - -
Feb 06 2010 15:49:00.023 QML 2247-V -I 145-V 144-V 78.88 3443 -39.68 23.68 676 Established 177.66 368 Reinforced - - - - - - No 284719N 0820325W - -
Feb 06 2010 15:49:00.023 MLB 1200-V -I 15-V 14-V 45.25 3015 -11.32 -20.97 475 Established 349.68 88 Beacon Only - - - - - - No 280239N 0813104W - -
Feb 06 2010 15:49:00.023 MLB 1011-V -I 91-V 90-V 94.50 3264 -56.77 10.21 698 Established 152.28 187 Beacon Only - - - - - - No 283341N 0822244W - -
- - - - - -

A: 

From your sample data it seems that you don't have to check for the presence of a three letter identifier following the date -- it's always there. Add a final three letters to the regex if that's not a valid assumption. Also, add more grouping as needed for regex groups to be useful to you. Anyway:

import re
dtre = re.compile(r'^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [0-9]{2} [0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}.[0-9]{3}')

[line for line in file if dtre.match(line)]

Wrap it in a with statement or whatever to open your file, then do any processing you need on the list this builds up.

Another possibility would be to use a generator expression instead of a list comprehension (replace the outer [ and ] with ( and ) to do so). This is useful if you're outputting results to somewhere as you go, the file is large and you don't need to have it all in memory for different purposes. Just be sure not to close the file before you consume the entire generator if you go with this approach!

Also, you could use datetime's built-in parsing facility:

import datetime

for line in file:
    try:
        # the line[:24] bit assumes you're always going to have three-digit
        # µs part
        dt = datetime.datetime.strptime(line[:24], '%b %d %Y %H:%M:%S.%f')
    except ValueError:
        # a ValueError means the beginning of the line isn't parseable as datetime
        continue
    # do something with the line; the datetime is already parsed and stored in dt

That's probably better if you're going to create the datetime.datetime object anyway.

Michał Marczyk
The date will change all the time. The format will remain the same.
Oh, I see. If you want to include lines with different `date` parts in your result set, I guess you do need a regex; will edit one in in a sec.
Michał Marczyk
Well, there it is. I've also added a `datetime`-based approach which may be cleaner, though you'd have to spoil it a little if you needed to allow for variable-length µs parts (which is probably not a problem for you here, since you're dealing with a rigid logfile format).
Michał Marczyk
BTW, look here: http://docs.python.org/library/datetime.html#strftime-behavior for docs on `datetime.datetime.strptime`.
Michał Marczyk
I have this that traps the line, but I do not know how to get it to return the rest of the line.([a-zA-Z]{3}\s\d\d\s\d\d\d\d\s\d\d:\d\d\)
Have you tried my code from the answer? The list comprehension in the top code snippet should get you a list of all the lines matching your specification. Entire lines, not just initial fragments matching the regex. In general, if you're matching a regex against a string, it doesn't alter the string in any way, so you can still use it later. (If you successfully match a regex against the string bound to the variable `line`, this doesn't break `line` in any way, so you can still just return it / append it to some list / print it out / whatever.)
Michał Marczyk
+1  A: 

seems like your date + 3 characters are always the first 5 fields (with space as delimiter). Just go through the file, and do a split on space to each line. Then get the first 5 fields

s=Split(strLineOfFile," ")
wscript.echo s(0),s(1),s(2),s(3),s(4)

No need regex

ghostdog74