views:

94

answers:

2

Hi

Can anyone suggest me some way of finding and parsing dates (in any format, "Aug06", "Aug2006", "August 2 2008", "19th August 2006", "08-06", "01-08-06") in the python.

I came across this question, but it is in perl... http://stackoverflow.com/questions/3445358/extract-inconsistently-formatted-date-from-string-date-parsing-nlp

Any suggestion would be helpful.

+2  A: 
from dateutil import parser


texts = ["Aug06", "Aug2006", "August 2 2008", "19th August 2006", "08-06", "01-08-06"]
for text in texts:
    print text, parser.parse(text)


Aug06            2010-08-06 00:00:00
Aug2006          2006-08-28 00:00:00
August 2 2008    2008-08-02 00:00:00
19th August 2006 2006-08-19 00:00:00
08-06            2010-08-06 00:00:00
01-08-06         2006-01-08 00:00:00

And if you want to find these dates in a longer text, then try to search for groups of numbers and months and try to give them to this parser. It will throw an exception if the text does not look like a date.

months = ['January', 'February',...]
months.extend([mon[:3] for mon in months])

# search for numeric dates:
/[\d \-]+/

# search for dates:
for word in sentence.split():
    if word in months:
        ...
eumiro
This is not a generic solution.
anand
One hopes that there's an easy way of turning off the "plug the gaps with the current value" caper ... "Aug2008" -> "2006-08-28" just because today is the 28th of the month is a bit of a boggler
John Machin
@anand: But he has answered one part of the question well - how to parse dates.
Tim Pietzcker
a date written as "01-08-06" can be interpreted as 1st of august or as 8th of January depending on the country.
BatchyX
Ugh. The defaults are extracted from a datetime object. No way of telling that the day was missing. Ugh2: the window for 2-digit years relates to the current year -- useless with historical data. Ugh3: the YMD/DMY/MDY "precedence" thingie doesn't allow detection of a mixture of data orders in your data.
John Machin
@Batchyx: ... or the 6th August 2010. Depending on the country? What country? It has a "dayfirst" boolean arg.
John Machin
+1  A: 

This finds all the dates in your example sentence:

for match in re.finditer(
    r"""(?ix)             # case-insensitive, verbose regex
    \b                    # match a word boundary
    (?:                   # match the following three times:
     (?:                  # either
      \d+                 # a number,
      (?:\.|st|nd|rd|th)* # followed by a dot, st, nd, rd, or th (optional)
      |                   # or a month name
      (?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)
     )
     [\s./-]*             # followed by a date separator or whitespace (optional)
    ){3}                  # do this three times
    \b                    # and end at a word boundary.""", 
    subject):
    # match start: match.start()
    # match end (exclusive): match.end()
    # matched text: match.group()

It's definitely not perfect and liable to miss some dates (especially if they are not in English - 21. Mai 2006 would fail, as well as 4ème décembre 1999), and to match nonsense like August Augst Aug, but since nearly everything is optional in your examples, there is not much you can do at the regex level.

The next step would be to feed all the matches into a parser and see if it can parse them into a sensible date.

The regex can't interpret context correctly. Imagine a (stupid) text like You'll find it in box 21. August 3rd will be the shipping date. It will match 21. August 3rd which of course can't be parsed.

Tim Pietzcker