tags:

views:

191

answers:

4

How can I find as many date patterns as possible from a text file by python? The date pattern is defined as:

dd mmm yyyy
  ^   ^
  |   |
  +---+--- spaces

where:

  • dd is a two digit number
  • mmm is three-character English month name (e.g. Jan, Mar, Dec)
  • yyyy is four digit year
  • there are two spaces as separators

Thanks!

+3  A: 

Here's a way to find all dates matching your pattern

re.findall(r'\d\d\s(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{4}', text)

But after WilhelmTell's comment on your question, I'm also wondering whether this it what you were really asking for...

David Zaslavsky
I want the actual dates. thanks!
ohho
A: 

Try this:

import re

allmatches = re.findall(r'\d\d \w\w\w \d\d\d\d', "string to match")
xyld
Seriously? -1? Any reason other than '\w\w\w' probably isn't that great of a way to match a month? It **IS** what the guy asked for in his 'dd mmm yyyy' syntax. While it's not ideal, I don't understand the downvote.
xyld
+2  A: 

Here's a slightly more complete example. The regexp will match more than just valid date value. datetime.strptime will fail to parse anything that is not valid and raise a ValueError. If the date is parsed, then you have a full datetime object that gives you access to a lot of functionality.

>>> from datetime import datetime
>>> import re
>>> dates = []
>>> patn = re.compile(r'\d{2} \w{3} \d{4}')
>>> fh = open('inputfile')
>>> for line in fh:
...   for match in patn.findall(line):
...     try:
...       val = datetime.strptime(match, '%d %b %Y')
...       dates.append(val)
...     except ValueError:
...       pass # ignore, this isn't a date
...

I imagine that this can be collapsed into nice tight code with comprehensions if you are so inclined.

D.Shawley
appreciated! how can I concat 'val's into an array in python?
ohho
Use `list.append()`. I updated the snippet.
D.Shawley
+1  A: 

Use the calendar module to give you a little global awareness:

date_expr = r"\d{2} (?:%s) \d{4}" % '|'.join(calendar.month_abbr[1:])
print date_expr
print re.findall(date_expr, source_text)

For me, this creates a date_expr like:

"\d{2} (:?Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4}"

But if I change my locale using the locale module:

locale.setlocale(0, "fr")

I now search for months in French:

"\d{2} (?:janv.|févr.|mars|avr.|mai|juin|juil.|août|sept.|oct.|nov.|déc.) \d{4}"

Hmm, this is the first time I ever tried French month abbreviations, I may need to do some cleanup:

date_expr = r"\d{2} (?:%s) \d{4}" % '|'.join(
    m.title().rstrip('.') for m in calendar.month_abbr[1:])

Now I get:

"\d{2} (?:Janv|Févr|Mars|Avr|Mai|Juin|Juil|Août|Sept|Oct|Nov|Déc) \d{4}"

And now my script will run for my Gallic friends as well, with really very little trouble.

(You may wonder why I had to slice the month_abbr list from [1:] - this list begins with an empty string in position 0, so that if you use find() to look up a particular month abbreviation, you will get back a number from 1-12, instead of from 0-11.)

-- Paul

Paul McGuire
This is why I prefer to use the RE to validate the basic format (_day month-abbrev year_) and then let `strptime` take care of the localization of the month. If you are really interested, you can use some of the locale-aware options to account for differences in M-D-Y ordering as well.
D.Shawley