This is actually a machine learning classification problem but I imagine there's a perfectly good quick-and-dirty way to do it. I want to map a string describing an NFL team, like "San Francisco" or "49ers" or "San Francisco 49ers" or "SF forty-niners", to a canonical name for the team. (There are 32 NFL teams so it really just means finding the nearest of 32 bins to put a given string in.)
The incoming strings are not actually totally arbitrary (they're from structured data sources like this: http://www.repole.com/sun4cast/stats/nfl2008lines.csv) so it's not really necessary to handle every crazy corner case like in the 49ers example above.
I should also add that in case anyone knows of a source of data containing both moneyline Vegas odds as well as actual game outcomes for the past few years of NFL games, that would obviate the need for this. The reason I need the canonicalization is to match up these two disparate data sets, one with odds and one with outcomes:
Ideas for better, more parsable, sources of data are very welcome!
Added: The substring matching idea might well suffice for this data; thanks! Could it be made a little more robust by picking the team name with the nearest levenshtein distance?