tags:

views:

83

answers:

3

I have a large list of files, some of which have dates embedded in the filename. The format of the dates is inconsistent and often incomplete, e.g. "Aug06", "Aug2006", "August 2006", "08-06", "01-08-06", "2006", "011004" etc. In addition to that, some filenames have unrelated numbers that look somewhat like dates, e.g. "20202010".

In short, the dates are normally incomplete, sometimes not there, are inconsistently formatted and are embedded in a string with other information, e.g. "Report Aug06.xls".

Are there any Perl modules available which will do a decent job of guessing the date from such a string? It doesn't have to be 100% correct, as it will be verified by a human manually, but I'm trying to make things as easy as possible for that person and there are thousands of entries to check :)

A: 

Date::Parse does what you want.

Cfreak
Date::Parse doesn't handle all of the other junk in the string nicely, so I have a 100% undefined rate using it; I need something clever enough to ignore the crud and find a date. It's as much natural language processing as date parsing, I suppose.
El Yobo
+2  A: 

Date::Parse is definitely going to be part of your answer - the bit that works out a randomly formatted date-like string and make an actual useable date out of it.

The other part of your problem - the rest of the characters in your filenames - is unusual enough that you're unlikely to find someone else has packaged up a module for you.

Without seeing more of your sample data, it's really only possible to guess, but I'd start by identifying possible or likely "date section" candidates.

Here's a nasty brute-force example using Date::Parse (a smarter approach would use a list of regex-en to try and identify dates-bits - I'm happy to burn cpu cycles to not think quite so hard though!)

!/usr/bin/perl
use strict;
use warnings;
use Date::Parse;

my @files=("Report Aug06.xls", "ReportAug2006", "Report 11th September 2006.xls", 
           "Annual Report-08-06", "End-of-month Report01-08-06.xls", "Report2006");

# assumption - longest likely date string is something like '11th September 2006' - 19 chars
# shortest is "2006" - 4 chars.
# brute force all strings from 19-4 chars long at the end of the filename (less extension)
# return the longest thing that Date::Parse recognises as a date



foreach my $file (@files){
  #chop extension if there is one
  $file=~s/\..*//;
  for my $len (-19..-4){
    my $string = substr($file, $len);
    my $time = str2time($string);
    print "$string is a date: $time = ",scalar(localtime($time)),"\n" if $time;
    last if $time;
    }
  }
bigiain
This is somewhat similar to how I did it in the end, but mine is much longer, uglier and scary :) I'll leave the question open for now, in case someone out there has come across the problem before, but it seems like a bit of a roll your own solution thing...
El Yobo
Your answer is essentially correct; there doesn't appear to be any libraries for doing this, you have to do it yourself :)
El Yobo
A: 

DateTime::Format::Natural looks like a candidate for this job. I can't vouch for it personally but it has good reviews.

Kinopiko
I did come across it, but like Date::Parse, Date::Manip et al it seems to require that all the data in the string is relevant to the date, whereas most of the content of my strings are just noise (other parts of the file name).
El Yobo