views:

105

answers:

2

Hello community,

I'm learning to program with Ruby on Rails and I've reached my hardest task yet. I have two tables, each containing a list of business categories (I.E. Thai Food Restaurant, Plumber, Office Supply Store). These two tables come from two different APIs where I'm essentially acting as the middle-man between them. They list more or less the same types of categories, but often times they will phrase it differently (I.E. auto body repair and painting VS automobile body repair and painting).

My first goal was to design the model for this task. I decided upon a "many to many" and tested it by manually mapping two rows. Done. Tested. Awesome.

My second goal is to write an algorithm to match rows from the two tables to each other, giving precedence based on similarity. I'm guessing a lot of the work can be done in MySQL. Here is my pseudo-code so far:

for each row in table 1
    split phrase up into words by spaces

    SELECT name, id FROM joined_table
        SELECT name, id FROM table2 AS word1 WHERE name LIKE '% word1 %'
        SELECT name, id FROM table2 AS word2 WHERE name LIKE '% word2 %'
        SELECT name, id FROM table2 AS word3 WHERE name LIKE '% word3 %'
        JOIN word1, word2, word3 WHERE word1.id == word2.id 
                                    OR word2.id == word3.id
        order by count of matches of each word

    insert relationships into map table
end

I've never designed a search algorithm before, so any help you could lend is much appreciated. I'm having fun figuring this out, but I thought I'd reach out and get some advice from the pro's out there.

Cheers!

Update: A co-worker recommended I check out a website called mechanical turk, which turned out to be the most cost-effective way of mapping the categories. All I had to do was build a simple form and it wound costing approximately $3.00 per thousand matches.

A: 

You shouldn't do it this way.

A better way is to normalize your data, and then simply match it

if "auto" == "automobile", then change all of the "automobile" to "auto". Much like you would convert everything from mixed case to upper or lowercase.

Store the actual value, "normalize" the data and store that too, then you can match the normalized data.

This also allows you to have "arbitrarily" sophisticated normalization rules than what is available in SQL, and you also run it only once, rather than for every query.

Edit -- for comment.

How about this then.

I still think you can normalize it with out a lot of work. Some work, yes, but not a huge amount.

Run through all of the data, and do some basic free form text processing. Normalize case, stemming, throw out "uninteresting" words, or "stop" words as they're called. You can find the basics for this kind of processing on the web. It's not crushingly difficult.

In the end, you will have a list of unique words for your text. You should scan this list, no doubt you will find other words you may want to strip out as "stop" words. If you find some "obvious" synonyms, you can make a synonym map as part of your normalization process. Up to you. There will likely not be an insane amount of "interesting" words, in the end.

A nice thing to do at this point is to create an inverted index table. For example, if you have 3 entries with the word "auto", you have 3 rows in your inverted index table:

"auto", 123
"auto", 456
"auto", 789

If you ever want to find what rows have "auto", you join to that index table.

Now, you're iterating across your dataset.

You convert your text, using the above technique, i.e. normalize the text, and now have a list of "interesting" words.

You use your inverted index to find all of the qualifying rows that are even potential candidates for a match, i.e. all rows that have at least one of the words in your "interesting" list.

Finally, you then normalize the text of these rows, do an intersection of your set of interesting words vs their interesting set. That gives you the number of matching words. You can then do something as simply as "number of matching words / number of words in candidate" to a basic percentage rating.

You can see, you can do some of this processing on the fly, you can store a lot the intermediate results (like the normalized text), or simply take the rows that are above a threshold (say, 75% score), and stick them in your joining table.

Also, this work can be done iteratively. You don't have to reprocess the entire data set again.

When you get a new row, find those rows that have shared words, then rescore THEM as well as the new one. When a row is deleted, simply remove the inverted index entries along with your joiner entry. When a row is updated, simply "delete" it first, then "add" it back. (i.e. remove all of the old relationships, then rescore the updated row and those with shared words).

Should be pretty quick in the end.

Will Hartung
Will, that's an interesting point and I'm sure a very good method for this type of task. I'm not sure it applies though... For one, there are 15k+ rows in one of the tables, so normalizing all of those words would take a lot of man hours. Secondly, I only need to run this once (or once every time the table is updated) to populate my join model table, then it's all just a matter of Rails doing the cat1.cat2's, which is a simple query. Thanks for your input. If you think I'm wrong, by all means let me know. Thanks for your time : )
Bloudermilk
A: 

Don't have a full answer, do have suggestions.

If I understand what you're trying to do I think it might be hard to fully automate.

If the datasets have '1% of weirdness' each, you can expect a few hundred manual mappings will be needed.

From the outset, assume some matches will be easy, some hard, and that several rules will be needed.

Your first rule might be 'exact phrase matches match', and perhaps the 'maximize matching word count' rule you describe might be good as a second rule. But you might have some false matches. In the example you quote, the word that doesn't match - auto versus automobile - is pretty key for realising that there is a match.

You'll need to have some way of handling the categories that don't have an equivalent on the other side, which in a set of 15K there surely will be.

When you get data updates, compare the new and old and only process the changes - minimise the work.

martin clayton