views:

36

answers:

2

I've got a 'somewhat' ugly field in a database which holds the names of locations. For instance, Madison Square Gardens which has also been entered as "The Madison Square Gardens", etc. etc.

I'm trying to extract the data so that I can get an accurate list of all the locations. In order to accomplish this, what I've done is created a sql query where I join the events for each location, and then group by the location name and only use location groups having more than 10 entries (that filters out the somewhat non-reliable entries), but I still end up with Some very different spellings and entries, resulting in duplicate properties/locations.

My SQL query looks like this

"SELECT location, COUNT(*) FROM locations 
JOIN event ON locations.lid=events.lid
WHERE `long`
BETWEEN - 74.419382608696
AND - 73.549817391304
AND lat
BETWEEN 40.314017391304
AND 41.183582608696
GROUP BY location 
HAVING COUNT(*)>10

Running this query provides 3 different entries "Madison Square Garden", "Madison Square Gardens", "The Madison Square Garden". Of course, this is only for the Madison Square Garden entry. Most entries have multiple slightly different spellings.

I restrict my searches by lat/long so I don't get locations with the same name in different cities grouped together.

Is there a way with Regular expressions or something in the GROUP clause to have these grouped consistently? Even just removing the trailing 's', and 'the' before the grouping would probably be a big benefit.

I was going to take each result and then do a regular expression match against all the locations in within the lat/long range?

Fortunately I have enough linked events to locations, that I am somewhat able to recognize the major locations.

Any other suggestions for extracting locations from semi-structured data? The data is scrapped from a variety of sources, so I don't have control over the input.

A: 

If your issue is treating "similar" strings the same, you may want to check out the SOUNDEX algorithm. I'm not sure if it will work for all of your different scenarios, but it's a start.

It's discussed in this thread: http://stackoverflow.com/questions/369755/how-do-i-do-a-fuzzy-match-of-company-names-in-mysql-with-php-for-auto-complete

dcp
Not a bad idea with Soundex, but the grouping by soundex(location) returns the same result as the non-soundex query.
pedalpete
Maybe you could try a more advanced algo then, like the one here: http://codejanitor.com/wp/2007/02/10/levenshtein-distance-as-a-mysql-stored-function/
dcp
i JUST read the exact same thing!
pedalpete
+1  A: 

Here are some suggestions for you.

Create a normalized venue-name column in your data base: (1) Run each name through some simple transformations ... Turn "The Madison Square Garden" and "The Washington Monument" into "Madison Square Garden" and "Washington Monument" Turn plural nouns into singular the easy way ... strip "es", then "s" from each word in your name. Downcase everything. Eliminate any remaining short words "a" "it" "the" "and" "&" you get the idea. Sort your words into alphabetical order, getting you "garden madison square" Store that resulting string into a new column in your table. Match on it, while still displaying your original string.

(2) Create a lookup table with variant spellings of venues. This works well for venue names like "Boston Garden" / "Fleet Center" / "TD Banknorth Garden" / "North Station" and junk like that. Same place, different spelling. ("Penn Station" for your example).

(3) You could use the Yahoo or Google Maps geocoding services, which will take incomplete names and addresses and standardize them.

Soundex is going to get you quite a few false positive matches. It's designed as a fallback and requires human disambiguation.

Ollie Jones
I was thinking something like that Ollie. Is there a way to do these transformations in mysql? or do I need to make the request into php, then transform the text and update? Any suggestions?
pedalpete
Pete, I think you're going to need to use php. To do it with a stored mySql function, for example, "NORMALIZE_NAME", you'd need a REGEXP_REPLACE primitive function, and that isn't ready yet.If this is a high-volume application, you might consider creating a mySql user-defined function. But you'd have to create that in C, which would be a PITA to maintain.
Ollie Jones
Thanks Ollie, I suspected as much, but thought it might be quicker in sql alone.
pedalpete
Pete, on further thought, you could probably do 90% of your normalization with a stored mySQL function. Certainly downcasing, eliminating leading "the" words and trailing "s" letters can be done with the string-manipulation stuff available to mySQL procedures. It would obviously be a lot more convenient to have this as a stored function than an external function.
Ollie Jones