views:

517

answers:

3

This is actually a machine learning classification problem but I imagine there's a perfectly good quick-and-dirty way to do it. I want to map a string describing an NFL team, like "San Francisco" or "49ers" or "San Francisco 49ers" or "SF forty-niners", to a canonical name for the team. (There are 32 NFL teams so it really just means finding the nearest of 32 bins to put a given string in.)

The incoming strings are not actually totally arbitrary (they're from structured data sources like this: http://www.repole.com/sun4cast/stats/nfl2008lines.csv) so it's not really necessary to handle every crazy corner case like in the 49ers example above.

I should also add that in case anyone knows of a source of data containing both moneyline Vegas odds as well as actual game outcomes for the past few years of NFL games, that would obviate the need for this. The reason I need the canonicalization is to match up these two disparate data sets, one with odds and one with outcomes:

Ideas for better, more parsable, sources of data are very welcome!

Added: The substring matching idea might well suffice for this data; thanks! Could it be made a little more robust by picking the team name with the nearest levenshtein distance?

A: 

If you know both the source and destination names, then you just need to map them. In php, you would just use an array with keys from the data source and values from the destination. Then you would reference them like:

$map = array('49ers' => 'San Francisco 49ers',
             'packers' => 'Green Bay Packers');

foreach($incoming_name as $name) {
   echo $map[$name];
}
willoller
Oh, but the question is how to avoid doing that manual mapping. But I suppose it really isn't that big of a deal... :) In any case, I thought it might be useful to have as general as possible a canonicalization function.
dreeves
You could use SOUNDEX or something, but the diversity of the data and the small size of the total dataset make a general solution less likely.
willoller
+1  A: 

Quick inspection by sight shows that both data sets contain the teams' locations (i.e. "Minnesota"). Only one of them has the teams' names. That is, one list looks like:

Denver
Minnesota
Arizona
Jacksonville

and the other looks like

Denver Broncos
Minnesota Vikings
Arizona Cardinals
Jacksonville Jaguars

Seems like, in this case, some pretty simple substring matching would do it.

Jim Mischel
+2  A: 

Here's something plenty robust even for arbitrary user input, I think. First, map each team (I'm using a 3-letter code as the canonical name for each team) to a fully spelled out version with city and team name as well as any nicknames in parentheses between city and team name.

Scan[(fullname[First@#] = #[[2]])&, {
  {"ari", "Arizona Cardinals"},                 {"atl", "Atlanta Falcons"}, 
  {"bal", "Baltimore Ravens"},                  {"buf", "Buffalo Bills"}, 
  {"car", "Carolina Panthers"},                 {"chi", "Chicago Bears"}, 
  {"cin", "Cincinnati Bengals"},                {"clv", "Cleveland Browns"}, 
  {"dal", "Dallas Cowboys"},                    {"den", "Denver Broncos"}, 
  {"det", "Detroit Lions"},                     {"gbp", "Green Bay Packers"}, 
  {"hou", "Houston Texans"},                    {"ind", "Indianapolis Colts"}, 
  {"jac", "Jacksonville Jaguars"},              {"kan", "Kansas City Chiefs"}, 
  {"mia", "Miami Dolphins"},                    {"min", "Minnesota Vikings"}, 
  {"nep", "New England Patriots"},              {"nos", "New Orleans Saints"}, 
  {"nyg", "New York Giants NYG"},               {"nyj", "New York Jets NYJ"}, 
  {"oak", "Oakland Raiders"},                   {"phl", "Philadelphia Eagles"}, 
  {"pit", "Pittsburgh Steelers"},               {"sdc", "San Diego Chargers"}, 
  {"sff", "San Francisco 49ers forty-niners"},  {"sea", "Seattle Seahawks"}, 
  {"stl", "St Louis Rams"},                     {"tam", "Tampa Bay Buccaneers"}, 
  {"ten", "Tennessee Titans"},                  {"wsh", "Washington Redskins"}}]

Then, for any given string, find the longest common subsequence for each of the full names of the teams. To give preference to strings matching at the beginning or the end (eg, "car" should match "carolina panthers" rather than "arizona cardinals") sandwich both the input string and the full names between spaces. Whichever team's full name has the [sic:] longest longest-common-subsequence with the input string is the team we return. Here's a Mathematica implementation of the algorithm:

teams = keys@fullnames;

(* argMax[f, domain] returns the element of domain for which f of that element is
   maximal -- breaks ties in favor of first occurrence. *)
SetAttributes[argMax, HoldFirst];
argMax[f_, dom_List] := Fold[If[f[#1] >= f[#2], #1, #2] &, First@dom, Rest@dom]

canonicalize[s_] := argMax[StringLength@LongestCommonSubsequence[" "<>s<>" ", 
                                 " "<>fullname@#<>" ", IgnoreCase->True]&, teams]
dreeves
Neat idea - so to increase this to a general solution you could add additional items to the long names like "Saint Louis St Louis Rams", or how you already did with SF.
willoller
Exactly! With "St Louis" vs "Saint Louis" though, the matching of the "Louis" will always be plenty. It's just for alternatives like "49ers" vs "forty-niners" with few letters in common that you'd need to add the additional items like you say.
dreeves
I almost recanted that since you might think "Saint Louis" could match to "New Orleans Saints" as well as "St Louis Rams" but in fact the longest shared substring in the first case is "saint" and in the 2nd case is "t louis" so it really is robust enough to such variations without adding anything.
dreeves