views:

153

answers:

9

I need to be able to recognise date strings. It doesn't matter if I can not distinguish between month and date (e.g. 12/12/10), I just need to classify the string as being a date, rather than converting it to a Date object. So, this is really a classification rather than parsing problem.

I will have pieces of text such as:

"bla bla bla bla 12 Jan 09 bla bla bla 01/04/10 bla bla bla"

and I need to be able to recognise the start and end boundary for each date string within.

I was wondering if anyone knew of any java libraries that can do this. My google-fu hasn't come up with anything so far.

UPDATE: I need to be able to recognise the widest possible set of ways of representing a dates. Of course the naive solution might be to write an if statement for every conceivable format, but a pattern recognition approach, with a trained model, is ideally what I'm after.

A: 

Usually dates are characters separated by a back/forward slash or a dash. Did you consider a regular expression?

I am assuming you are not looking to classify dates of the type Sunday, October 3rd 2010 and so on

npinti
Yes, I am. ANY date format.
Joel
You are unusually wrong. There is a whole world outside and I am afraid that most countries does not use slash as date separator.
Paweł Dyda
A: 

I don't know of any library that can do this but writing your own wouldn't be incredibly hard. Assuming your dates are all formatted with the slashes like 12/12/12 then you could verify you have three '\'s. You could get even more technical and have it check the values in between the slashes. For instance, if you have:

30/12/10

Then you know that 30 is the days and 12 is the month. However if you get 30/30/10 you know that even though ti has the correct format, it cannot be a date because there are no '30' months.

Glenn Nelson
A: 

Maybe you should use regular expressions?

Hopefully this one would work for mm-dd-yyyy format:

^(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\d\d$

Here (0[1-9]|1[012]) matches the month 00..12, (0[1-9]|[12][0-9]|3[01]) matches a date 00..31 and (19|20)\d\d matches a year.

Fields can be delmited by dash, slash or a dot.

Regards, Serge

zserge
The are loads of ways to represent a date. Although I could use simple heuristics a classifier might be more robust. I need to recognise ANY date format.
Joel
@Joel then maybe you can split the string using [- / .] regex and then make sure it has 3 fields and each of them mathces one of the expressions for date (from 0 to 30), month (from 0 to 12) and year (19xx/20xx or just xx)?
zserge
Yes, seems like a good approach - to split on any non alphanumeric character and then test each field independently and making sure that you have at least one candidate for each of month, day and year.
Joel
A: 

I don't know of any library that does this either. I would suggest a mix of nested recursive functions and regular expressions (a lot) to match strings and try to come up with a best guess to see if it can be a date. Dates can be written in a lot of different ways, some people might write them out as "Sunday, October 3 2010" or "Sunday, October 3rd 2010" or "10/03/2010" or "10/3/2010" and a whole bunch of different ways (even more if you are considering dates in other languages/cultures).

prototypef
A: 

You could always check to see if there are two '/' characters in a string.

public static boolean isDate(){
     String date = "12/25/2010";
     int counter = 0;
     for(int i=0; i<date.length(); i++){
          if ("\/-.".indexOf(date.charAt(i)) != -1) //Any symbol can be used. 
               counter++;
     }
     if(counter == 2)    //If there are two symbols in the string,
          return true;   //Return true.
     else
          return false;
}

You can do something similar to check to see if everything else is an integer.

Salem
A: 

It is virtually impossible to recognize all possible date formats as dates using "standard" algorithms. That's just because there are so many of them.

We, humans are capable of doing that just because we learned that something like 2010-03-31 resembles date. In other words, I would suggest to use Machine Learning algorithms and teach your program to recognize valid date sequences. With Google Prediction API that should be feasible.

Or you can use Regular Expressions as suggested above, to detect some but not all date formats.

Paweł Dyda
+1  A: 

I am sure researchers in information extraction have looked at this problem, but I couldn't find a paper.

One thing you can try is do it as a two step process. (1) after collecting as much data as you can, extract features, some features that come to mind: number of numbers that appear in the string, number of numbers from 1-31 that appear in the string, number of numbers from 1-12 that appear in the string, number of months names that appear in the string, and so on. (2) learn from the features using some type of binary classification method (SVM for example) and finally (3) when a new string comes by, extract the features and query the SVM for a prediction.

carlosdc
+1 , An SVM might be a reasonable learning tool.
Joel
+1  A: 

You can loop all available date formats in Java:

for (Locale locale : DateFormat.getAvailableLocales()) {
    for (int style =  DateFormat.FULL; style <= DateFormat.SHORT; style ++) {
        DateFormat df = DateFormat.getDateInstance(style, locale);
        try {
                df.parse(dateString);
                // either return "true", or return the Date obtained Date object
        } catch (ParseException ex) {
            continue; // unperasable, try the next one
        }
    }
}

This however won't account for any custom date formats.

Bozho
Yes, had considered this, but it is ultimately a finite list.
Joel