views:

92

answers:

4

I have an application which scrapes soccer results from different sources on the web. Team names are not consistent on different websites - eg Manchester United might be called 'Man Utd' on one site, 'Man United' on a second, 'Manchester United FC' on a third. I need to map all possible derivations back to a single name ('Manchester United'), and repeat the process for each of 20 teams in the league (Arsenal, Liverpool, Man City etc). Obviously I don't want any bad matches [eg 'Man City' being mapped to 'Manchester United'].

Right now I specify regexes for all the possible combinations - eg 'Manchester United' would be 'man(chester)?(u|(utd)|(united))(fc)?'; this is fine for a couple of sites but is getting increasingly unwieldy. I'm looking for a solution which would avoid having to specify these regexes. Eg there must be a way to 'score' Man Utd so it gets a high score against 'Manchester United', but a low / zero score against 'Liverpool' [for example]; I'd test the sample text against all possible solutions and pick the one with the highest score.

My sense is that the solution may be similar to the classic example of a neural net being trained to recognise handwriting [ie there is a fixed set of possible outcomes, and a degree of noise in the input samples]

Anyone have any ideas ?

Thanks.

+1  A: 

You could use some similarity metric on the strings involved and a hand tuned threshold. Alternatively the threshold could also be trained by some a machine learning approach. Which particular similarity metric works best depends on the kind of strings you want to match. You might also need to pre-process the strings before applying a metric to them (i.e. remove noise characters like spaces etc., normalize capitalization, resolve common previously known abbreviations, ...)

For a quite comprehensive overview of different string similarity metrics and a Java library see http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

michid
Thanks. This is exactly what I was looking for.
Justin
A: 

You might also want to do some structural analysis of the text. A part-of-speech parser might hint at which words are being used as proper nouns, giving you additional clues that "mn au" was "Man U" typed by someone with dyslexic fingers in a hurry--something no regex is going to figure out.

Being able to "train" the software is probably best, too--adding specific spellings as you find them.

Parsing natural language is hard! Good luck!

Alex Feinman
+1  A: 

It appears that you're screen scraping the same sources.

Assuming your sources are consistent in naming the teams, a string conversion would be the most effective solution.

Man Utd -> Manchester United

Manchester United FC -> Manchester United

Gilbert Le Blanc
+2  A: 

I've solved this exact problem in Python but without any sophisticated AI. I just have a text file that maps the different variations to the canonical form of the name. There aren't that many variations and once you've enumerated them all they will rarely change.

My file looks something like this:

man city=Manchester City
man united=Manchester United
man utd=Manchester United
manchester c=Manchester City
manchester utd=Manchester United

I load these aliases into a dictionary object and then when I have a name to map, I convert it to lowercase (to avoid any problems with differing capitalisation) and then look it up in the dictionary.

If you know how many teams there are supposed to be, you can also add a check to warn you if you find more distinct names than you are expecting.

Dan Dyer