ansaurus

Question

Answer 1

+1 A:

You could use some similarity metric on the strings involved and a hand tuned threshold. Alternatively the threshold could also be trained by some a machine learning approach. Which particular similarity metric works best depends on the kind of strings you want to match. You might also need to pre-process the strings before applying a metric to them (i.e. remove noise characters like spaces etc., normalize capitalization, resolve common previously known abbreviations, ...)

For a quite comprehensive overview of different string similarity metrics and a Java library see http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

michid 2010-08-02 11:46:00

Thanks. This is exactly what I was looking for.

Justin 2010-08-02 12:37:11

Answer 2

A:

You might also want to do some structural analysis of the text. A part-of-speech parser might hint at which words are being used as proper nouns, giving you additional clues that "mn au" was "Man U" typed by someone with dyslexic fingers in a hurry--something no regex is going to figure out.

Being able to "train" the software is probably best, too--adding specific spellings as you find them.

Parsing natural language is hard! Good luck!

Alex Feinman 2010-08-02 12:46:49

Answer 3

+1 A:

It appears that you're screen scraping the same sources.

Assuming your sources are consistent in naming the teams, a string conversion would be the most effective solution.

Man Utd -> Manchester United

Manchester United FC -> Manchester United

Gilbert Le Blanc 2010-08-02 13:19:04

Answer 4

+2 A:

I've solved this exact problem in Python but without any sophisticated AI. I just have a text file that maps the different variations to the canonical form of the name. There aren't that many variations and once you've enumerated them all they will rarely change.

My file looks something like this:

man city=Manchester City
man united=Manchester United
man utd=Manchester United
manchester c=Manchester City
manchester utd=Manchester United

I load these aliases into a dictionary object and then when I have a name to map, I convert it to lowercase (to avoid any problems with differing capitalisation) and then look it up in the dictionary.

If you know how many teams there are supposed to be, you can also add a check to warn you if you find more distinct names than you are expecting.

Dan Dyer 2010-08-02 22:23:58

ansaurus

tags:

views:

answers:

Algorithm for matching 'noisy' names

related questions