I have an application which scrapes soccer results from different sources on the web. Team names are not consistent on different websites - eg Manchester United might be called 'Man Utd' on one site, 'Man United' on a second, 'Manchester United FC' on a third. I need to map all possible derivations back to a single name ('Manchester United'), and repeat the process for each of 20 teams in the league (Arsenal, Liverpool, Man City etc). Obviously I don't want any bad matches [eg 'Man City' being mapped to 'Manchester United'].
Right now I specify regexes for all the possible combinations - eg 'Manchester United' would be 'man(chester)?(u|(utd)|(united))(fc)?'; this is fine for a couple of sites but is getting increasingly unwieldy. I'm looking for a solution which would avoid having to specify these regexes. Eg there must be a way to 'score' Man Utd so it gets a high score against 'Manchester United', but a low / zero score against 'Liverpool' [for example]; I'd test the sample text against all possible solutions and pick the one with the highest score.
My sense is that the solution may be similar to the classic example of a neural net being trained to recognise handwriting [ie there is a fixed set of possible outcomes, and a degree of noise in the input samples]
Anyone have any ideas ?
Thanks.