views:

85

answers:

3

Hello, I want to determine if a string is the name of a month and I want to do it relatively quickly. The function that is currently stuck in my brain is something like:

boolean isaMonth( String str ) {
    String[] months = DateFormatSymbols.getInstance().getMonths();
    String[] shortMonths = DateFormatSymbols.getInstance().getShortMonths();
    int i;

    for( i = 0; i<months.length(); ++i;) {
        if( months[i].equals(str) ) return true;
        if( shortMonths[i].equals(str ) return true;
    }
    return false;
}

However, I will be processing lots of text, passed one string at a time to this function, and most of the time I will be getting the worst case of going through the entire loop and returning false.

I saw another question that talked about a Regex to match a month name and a year which could be adapted for this situation. Would the Regex be faster? Is there any other solution that might be faster?

+3  A: 

Why not store the month names in a HashSet? This will give you constant time lookup instead of the linear time lookup you are getting from your loop.

import java.util.HashSet;
import java.util.Collections;
import java.text.DateFormatSymbols;

class Test {
  public static void main(String[] args) {

    HashSet<String> months = new HashSet<String>(24);  

    Collections.addAll(months, DateFormatSymbols.getInstance().getMonths());
    Collections.addAll(months, DateFormatSymbols.getInstance().getShortMonths());

    System.out.println(months.contains(args[0]));

  }
}
dbyrne
How about `Collections.addAll(months, DateFormatSymbols.getInstance().getMonths())`?
spong
Good call, changed the code
dbyrne
This answer got me going in the right direction. I also like some of what Kevin Day says a couple of answers down, and I will try to incorporate some of that a little later.
jonc
+1  A: 

Merge months and shortMonths into a single sorted array and do a binary search on the array. Or merge them both into a Set (HashSet) and use contains. Change all the month names to lowercase and do the same with the search value, if you want to be case insensitive.

If you want to be able to retrieve the number of the month, merge them all into a Map (HashMap) with the value being the month number.

drawnonward
+1  A: 

HashSet is a good general purpose solution - but I think you can do better. Take a look at the first letter of the months - jfmasond - if you pre-filter on those, and only do the HashSet check if it passes, it will take care of a huge number of your 'return false' scenarios.

You can set this up a couple of ways - one super easy way to do it is to use a switch statement, although a lookup table would be faster. Note also that you only need to do the check if the first character is between a and s, so a lookup table doesn't have to have the full unicode (or UTF-8 depending on the requirements) code space.

To make this even more effective, you can construct your lookup table so it contains the first 2 characters of each month - the resulting lookup table isn't too big, and this would drastically reduce the number of words that need to be checked against the hashset.

PS - before you do any of this, you should do some profiling and make sure that this is the area of your code that is actually the bottleneck.

Kevin Day
Nice thought about the lookup table. The whole project hasn't quite come together yet though, so I will hold off on the lookup table until I can do some better profiling
jonc
@Kevin Day: basically your describing creating your own very-specific state machine (which I commented to the OP before noticing your answer). In Java a carefully constructed *switch* statement shall be faster than a table lookup: basically you want your switch statement to become the *tableswitch* JVM opcode (not the *lookupswitch* one, which is slower). Doing this, you dodge the lookup table and hence you'll be faster (less often invalidating caches etc.). Now of course I doubt the OP really need these kind of optimizations in his case.
Webinator
@webinator - good info - I haven't done enough spelunking in the Java bytecode, but your description makes sense. I'm interested in what causes the tableswitch vs lookupswitch opcode to be interpreted - I'll look that up.
Kevin Day