views:

61

answers:

1

I have a large number (about 40 million) of VARCHAR entries in a MySQL table. The length of the string can be anywhere between 5-80 characters. I am trying to group similar text together and thought of a possible approach:

Take a row and calculate a similarity measure (like Edit Distance) with every other row and decide (I am not sure how to decide though) whether each belongs to the same group. For instance, I have the following entries:

The quick brown fox
The qick brwn fox
This is another sentence
Ths is another sntence

I want to be able to convert this into a form where I assign a group ID and then get the best match (so in this case, it would be 'The quick brown fox' and 'This is another sentence' but assign the group ID of 1 to both the 'The quick brown fox' and 'The qick brwn fox' entries and 2 to the other set).

Is there a better approach for such a problem? Like maybe utilize indexing schemes or other database advantages? Also, just a confirmation, I am not trying to find rows containing similar text, but rather rows that are similar to each other. Perhaps, a good reasoning I can give is that some rows are different due to typo errors and I want to correct them.

EDIT 2: Open to other ways not using MySQL that could be reasonably comparable to a DB's performance

So after a little research and the answer given below, this will not be so easy and I might have to look into fuzzy matching. Are there any good approaches for this considering that my data is now stored in a database?

EDIT 1: Attempt using MySQL's FULLTEXT

mysql> create table fulltextsim(id INT PRIMARY KEY AUTO_INCREMENT, text TEXT, FULLTEXT(text));
Query OK, 0 rows affected (0.44 sec)

mysql> insert into fulltextsim(text) VALUES("The quick brown fox");
Query OK, 1 row affected (0.02 sec)

mysql> insert into fulltextsim(text) VALUES("The qick brwn fox");
Query OK, 1 row affected (0.00 sec)

mysql> insert into fulltextsim(text) VALUES("This is another sentence");
Query OK, 1 row affected (0.00 sec)

mysql> insert into fulltextsim(text) VALUES("Ths is anther sntence");
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM fulltextsim WHERE MATCH(text) AGAINST ('The qick brwn');
+----+-------------------+
| id | text              |
+----+-------------------+
|  2 | The qick brwn fox |
+----+-------------------+
1 row in set (0.02 sec)

mysql> SELECT * FROM fulltextsim WHERE MATCH(text) AGAINST ('The qick fox');
+----+-------------------+
| id | text              |
+----+-------------------+
|  2 | The qick brwn fox |
+----+-------------------+
1 row in set (0.00 sec)

I wanted the 'The quick brown fox' row as well.

A: 

Have you looked at MySQL's FULLTEXT functiality?

UPDATE -- MySQL's FULLTEXT doesn't seem to support fuzzy searching, which is what you are looking for here. Check out http://stackoverflow.com/questions/3047641/mysql-full-text-search-boolean-mode-partial-match

MySQL does support the SOUNDEX() function, which will match words that sound similar to what is entered, but this does not work for phrases.

So, I think you may be out of luck.

Matthew Flynn
@Matthew: Yes. I did that before posting this question. I updated my post with the results I got. Please correct me if I'm using it incorrectly.
Legend
@Matthew: Thank you for the pointers. From what you mentioned, I guess this cannot be done just by using MySQL. I revised my question to a broader range - I am open to 'any' approach that can be reasonably fast on millions of rows.
Legend