ansaurus

Question

Answer 1

A:

If you know both the source and destination names, then you just need to map them. In php, you would just use an array with keys from the data source and values from the destination. Then you would reference them like:

$map = array('49ers' => 'San Francisco 49ers',
             'packers' => 'Green Bay Packers');

foreach($incoming_name as $name) {
   echo $map[$name];
}

willoller 2009-01-22 22:19:41

Oh, but the question is how to avoid doing that manual mapping. But I suppose it really isn't that big of a deal... :) In any case, I thought it might be useful to have as general as possible a canonicalization function.

dreeves 2009-01-23 00:19:47

You could use SOUNDEX or something, but the diversity of the data and the small size of the total dataset make a general solution less likely.

willoller 2009-01-24 00:34:44

Answer 2

+1 A:

Quick inspection by sight shows that both data sets contain the teams' locations (i.e. "Minnesota"). Only one of them has the teams' names. That is, one list looks like:

Denver
Minnesota
Arizona
Jacksonville

and the other looks like

Denver Broncos
Minnesota Vikings
Arizona Cardinals
Jacksonville Jaguars

Seems like, in this case, some pretty simple substring matching would do it.

Jim Mischel 2009-01-23 00:11:44

Answer 3

+2 A:

Here's something plenty robust even for arbitrary user input, I think. First, map each team (I'm using a 3-letter code as the canonical name for each team) to a fully spelled out version with city and team name as well as any nicknames in parentheses between city and team name.

Scan[(fullname[First@#] = #[[2]])&, {
  {"ari", "Arizona Cardinals"},                 {"atl", "Atlanta Falcons"}, 
  {"bal", "Baltimore Ravens"},                  {"buf", "Buffalo Bills"}, 
  {"car", "Carolina Panthers"},                 {"chi", "Chicago Bears"}, 
  {"cin", "Cincinnati Bengals"},                {"clv", "Cleveland Browns"}, 
  {"dal", "Dallas Cowboys"},                    {"den", "Denver Broncos"}, 
  {"det", "Detroit Lions"},                     {"gbp", "Green Bay Packers"}, 
  {"hou", "Houston Texans"},                    {"ind", "Indianapolis Colts"}, 
  {"jac", "Jacksonville Jaguars"},              {"kan", "Kansas City Chiefs"}, 
  {"mia", "Miami Dolphins"},                    {"min", "Minnesota Vikings"}, 
  {"nep", "New England Patriots"},              {"nos", "New Orleans Saints"}, 
  {"nyg", "New York Giants NYG"},               {"nyj", "New York Jets NYJ"}, 
  {"oak", "Oakland Raiders"},                   {"phl", "Philadelphia Eagles"}, 
  {"pit", "Pittsburgh Steelers"},               {"sdc", "San Diego Chargers"}, 
  {"sff", "San Francisco 49ers forty-niners"},  {"sea", "Seattle Seahawks"}, 
  {"stl", "St Louis Rams"},                     {"tam", "Tampa Bay Buccaneers"}, 
  {"ten", "Tennessee Titans"},                  {"wsh", "Washington Redskins"}}]

Then, for any given string, find the longest common subsequence for each of the full names of the teams. To give preference to strings matching at the beginning or the end (eg, "car" should match "carolina panthers" rather than "arizona cardinals") sandwich both the input string and the full names between spaces. Whichever team's full name has the [sic:] longest longest-common-subsequence with the input string is the team we return. Here's a Mathematica implementation of the algorithm:

teams = keys@fullnames;

(* argMax[f, domain] returns the element of domain for which f of that element is
   maximal -- breaks ties in favor of first occurrence. *)
SetAttributes[argMax, HoldFirst];
argMax[f_, dom_List] := Fold[If[f[#1] >= f[#2], #1, #2] &, First@dom, Rest@dom]

canonicalize[s_] := argMax[StringLength@LongestCommonSubsequence[" "<>s<>" ", 
                                 " "<>fullname@#<>" ", IgnoreCase->True]&, teams]

dreeves 2009-01-23 07:49:04

Neat idea - so to increase this to a general solution you could add additional items to the long names like "Saint Louis St Louis Rams", or how you already did with SF.

willoller 2009-01-24 00:38:51

Exactly! With "St Louis" vs "Saint Louis" though, the matching of the "Louis" will always be plenty. It's just for alternatives like "49ers" vs "forty-niners" with few letters in common that you'd need to add the additional items like you say.

dreeves 2009-01-26 03:10:35

I almost recanted that since you might think "Saint Louis" could match to "New Orleans Saints" as well as "St Louis Rams" but in fact the longest shared substring in the first case is "saint" and in the 2nd case is "t louis" so it really is robust enough to such variations without adding anything.

dreeves 2009-01-26 03:19:46

ansaurus

tags:

views:

answers:

Canonicalize NFL team names.

related questions