I have a large number (about 40 million) of VARCHAR entries in a MySQL table. The length of the string can be anywhere between 5-80 characters. I am trying to group similar text together and thought of a possible approach:
Take a row and calculate a similarity measure (like Edit Distance) with every other row and decide (I am not sure how to decide though) whether each belongs to the same group. For instance, I have the following entries:
The quick brown fox
The qick brwn fox
This is another sentence
Ths is another sntence
I want to be able to convert this into a form where I assign a group ID and then get the best match (so in this case, it would be 'The quick brown fox' and 'This is another sentence' but assign the group ID of 1 to both the 'The quick brown fox' and 'The qick brwn fox' entries and 2 to the other set).
Is there a better approach for such a problem? Like maybe utilize indexing schemes or other database advantages? Also, just a confirmation, I am not trying to find rows containing similar text, but rather rows that are similar to each other. Perhaps, a good reasoning I can give is that some rows are different due to typo errors and I want to correct them.
EDIT 2: Open to other ways not using MySQL that could be reasonably comparable to a DB's performance
So after a little research and the answer given below, this will not be so easy and I might have to look into fuzzy matching. Are there any good approaches for this considering that my data is now stored in a database?
EDIT 1: Attempt using MySQL's FULLTEXT
mysql> create table fulltextsim(id INT PRIMARY KEY AUTO_INCREMENT, text TEXT, FULLTEXT(text));
Query OK, 0 rows affected (0.44 sec)
mysql> insert into fulltextsim(text) VALUES("The quick brown fox");
Query OK, 1 row affected (0.02 sec)
mysql> insert into fulltextsim(text) VALUES("The qick brwn fox");
Query OK, 1 row affected (0.00 sec)
mysql> insert into fulltextsim(text) VALUES("This is another sentence");
Query OK, 1 row affected (0.00 sec)
mysql> insert into fulltextsim(text) VALUES("Ths is anther sntence");
Query OK, 1 row affected (0.00 sec)
mysql> SELECT * FROM fulltextsim WHERE MATCH(text) AGAINST ('The qick brwn');
+----+-------------------+
| id | text |
+----+-------------------+
| 2 | The qick brwn fox |
+----+-------------------+
1 row in set (0.02 sec)
mysql> SELECT * FROM fulltextsim WHERE MATCH(text) AGAINST ('The qick fox');
+----+-------------------+
| id | text |
+----+-------------------+
| 2 | The qick brwn fox |
+----+-------------------+
1 row in set (0.00 sec)
I wanted the 'The quick brown fox' row as well.