I've got a 'somewhat' ugly field in a database which holds the names of locations. For instance, Madison Square Gardens which has also been entered as "The Madison Square Gardens", etc. etc.
I'm trying to extract the data so that I can get an accurate list of all the locations. In order to accomplish this, what I've done is created a sql query where I join the events for each location, and then group by the location name and only use location groups having more than 10 entries (that filters out the somewhat non-reliable entries), but I still end up with Some very different spellings and entries, resulting in duplicate properties/locations.
My SQL query looks like this
"SELECT location, COUNT(*) FROM locations JOIN event ON locations.lid=events.lid WHERE `long` BETWEEN - 74.419382608696 AND - 73.549817391304 AND lat BETWEEN 40.314017391304 AND 41.183582608696 GROUP BY location HAVING COUNT(*)>10
Running this query provides 3 different entries "Madison Square Garden", "Madison Square Gardens", "The Madison Square Garden". Of course, this is only for the Madison Square Garden entry. Most entries have multiple slightly different spellings.
I restrict my searches by lat/long so I don't get locations with the same name in different cities grouped together.
Is there a way with Regular expressions or something in the GROUP clause to have these grouped consistently? Even just removing the trailing 's', and 'the' before the grouping would probably be a big benefit.
I was going to take each result and then do a regular expression match against all the locations in within the lat/long range?
Fortunately I have enough linked events to locations, that I am somewhat able to recognize the major locations.
Any other suggestions for extracting locations from semi-structured data? The data is scrapped from a variety of sources, so I don't have control over the input.