views:

184

answers:

3

Everyone knows the "=" sign.

SELECT * FROM mytable WHERE column1 = column2;

However, what if I have different contents in column1 and column2...but they are VERY similar? (maybe off by a space, or have a word that's different).

Is it possible to:

SELECT * FROM mytable WHERE ....column matches column2 with .4523423 "Score"...

I believe this is called fuzzy matching? Or pattern matching? That's the technical term for it.

EDIT: I know about Soundex and Levenstein disatance. IS that what you recommend?

A: 

you retrieve all the data then process it with whatever you are using...

Luiscencio
+4  A: 

What you are looking for is called Levenstein distance. It gives you the number value which discribes the difference between two strings.

In MySQL you have to write stored procedure for that. Here is the articla that may help.

Lukasz Lysik
I understand that I need to use levenstein distance. But, how do I scale this? I have 40 million rows. How can I efficiently do this or distribute it so it doesn't crash the server?
TIMEX
who the (foo) is Levenshtein?
Luiscencio
A: 

Lukasz Lysik posted a reference to a stored procedure that can do the fuzzy match from inside the database. If you will want to do this as an ongoing task, that is your best bet.

But if you want to do this as a one-off task, and if you might want to do complicated checks, or if you want to do something complicated to clean up the fuzzy matches, you might want to do the fuzzy matching from within Python. (One of your tags is "python" so I assume you are open to a Python solution...)

Using a Python ORM, you can get a Python list with one object per database row, and then use the full power of Python to analyze your data. You could use regular expressions, Python Levenstein functions, or anything else.

The all-around best ORM for Python is probably SQLAlchemy. I actually like the ORM from Django a little better; it's a little simpler, and I value simplicity. If your ORM needs are not complicated, the Django ORM may be a good choice. If in doubt, just go to SQLAlchemy.

Good luck!

steveha