ansaurus

Question

Many to many and writing a semi-smart matching algorithm

Answer 1

A:

You shouldn't do it this way.

A better way is to normalize your data, and then simply match it

if "auto" == "automobile", then change all of the "automobile" to "auto". Much like you would convert everything from mixed case to upper or lowercase.

Store the actual value, "normalize" the data and store that too, then you can match the normalized data.

This also allows you to have "arbitrarily" sophisticated normalization rules than what is available in SQL, and you also run it only once, rather than for every query.

Edit -- for comment.

How about this then.

I still think you can normalize it with out a lot of work. Some work, yes, but not a huge amount.

Run through all of the data, and do some basic free form text processing. Normalize case, stemming, throw out "uninteresting" words, or "stop" words as they're called. You can find the basics for this kind of processing on the web. It's not crushingly difficult.

In the end, you will have a list of unique words for your text. You should scan this list, no doubt you will find other words you may want to strip out as "stop" words. If you find some "obvious" synonyms, you can make a synonym map as part of your normalization process. Up to you. There will likely not be an insane amount of "interesting" words, in the end.

A nice thing to do at this point is to create an inverted index table. For example, if you have 3 entries with the word "auto", you have 3 rows in your inverted index table:

"auto", 123
"auto", 456
"auto", 789

If you ever want to find what rows have "auto", you join to that index table.

Now, you're iterating across your dataset.

You convert your text, using the above technique, i.e. normalize the text, and now have a list of "interesting" words.

You use your inverted index to find all of the qualifying rows that are even potential candidates for a match, i.e. all rows that have at least one of the words in your "interesting" list.

Finally, you then normalize the text of these rows, do an intersection of your set of interesting words vs their interesting set. That gives you the number of matching words. You can then do something as simply as "number of matching words / number of words in candidate" to a basic percentage rating.

You can see, you can do some of this processing on the fly, you can store a lot the intermediate results (like the normalized text), or simply take the rows that are above a threshold (say, 75% score), and stick them in your joining table.

Also, this work can be done iteratively. You don't have to reprocess the entire data set again.

When you get a new row, find those rows that have shared words, then rescore THEM as well as the new one. When a row is deleted, simply remove the inverted index entries along with your joiner entry. When a row is updated, simply "delete" it first, then "add" it back. (i.e. remove all of the old relationships, then rescore the updated row and those with shared words).

Should be pretty quick in the end.

Will Hartung 2009-10-15 18:23:17

Will, that's an interesting point and I'm sure a very good method for this type of task. I'm not sure it applies though... For one, there are 15k+ rows in one of the tables, so normalizing all of those words would take a lot of man hours. Secondly, I only need to run this once (or once every time the table is updated) to populate my join model table, then it's all just a matter of Rails doing the cat1.cat2's, which is a simple query. Thanks for your input. If you think I'm wrong, by all means let me know. Thanks for your time : )

Bloudermilk 2009-10-15 18:52:37

Answer 2

A:

Don't have a full answer, do have suggestions.

If I understand what you're trying to do I think it might be hard to fully automate.

If the datasets have '1% of weirdness' each, you can expect a few hundred manual mappings will be needed.

From the outset, assume some matches will be easy, some hard, and that several rules will be needed.

Your first rule might be 'exact phrase matches match', and perhaps the 'maximize matching word count' rule you describe might be good as a second rule. But you might have some false matches. In the example you quote, the word that doesn't match - auto versus automobile - is pretty key for realising that there is a match.

You'll need to have some way of handling the categories that don't have an equivalent on the other side, which in a set of 15K there surely will be.

When you get data updates, compare the new and old and only process the changes - minimise the work.

martin clayton 2009-10-15 22:41:31

ansaurus

tags:

views:

answers:

Many to many and writing a semi-smart matching algorithm

related questions