views:

68

answers:

3

We have a simple SQL problem here. In a varchar column, we wanted to search for a string anywhere in the field. What is the best way to implement this for performance? Obviously an index is not going to help here, any other tricks?

We are using MySQL and have about 3 million records. We need to execute many of these queries per second so really trying to implement these with the best performance.

The most simple way to do this is so far is:

Select * from table where column like '%search%'

I should further specify that the column is actually a long string like "sadfasdfwerwe" and I have to search for "asdf" in this column. So they are not sentences and trying to match a word in them. Would full text search still help here?

A: 

I you want to match whole words, look at a FULLTEXT index & MATCH() AGAINST(). And of course, take a load of your database server: cache results for a appropriate amount of time for you specific needs.

Wrikken
A: 

First, maybe this is an issue with a badly designed table that stores a delimited string in one field instead of correctly designing to make a related table. If this is the case, you should fix your design.

If you have a field with long descriptive text (saya a notes field) and the search is always by whole word, you can do a full-text search.

Consider if you can require your users to at least give you the first character of what they are searching for if it is an ordinary field like Last_name.

Consider doing an exact match search first and only performing the wildcard match if no results are returned. This will work if you have users who can provide exact matches. We did this once with airport name searches, it came back really fast if they put inthe exact name and slower if they did not.

If you want to search just for strings that are not words that may be somewhere in the text, you are pretty much stuck with bad performance.

HLGEM
A: 

Check out my presentation Practical Fulltext Search in MySQL.

I compared:

Today what I would use is Apache Solr, which puts Lucene into a service with a bunch of extra features and tools.


Re your comment: Aha, okay, no. None of the fulltext search capabilities I mentioned are going to help, since they all assume some kind of word boundaries

The other way to efficiently find arbitrary substrings is the N-gram approach. Basically, create an index of all possible sequences of N letters and point to the strings where each respective sequence occurs. Typically this is done with N=3, or a trigram, because it's a point of compromise between matching longer substrings and keeping the index to a manageable size.

I don't know of any SQL database that supports N-gram indexing transparently, but you could set it up yourself using an inverted index:

create table trigrams (
  trigram char(3) primary key
);

create table trigram_matches (
  trigram char(3),
  document_id int,
  primary key (trigram, document_id),
  foreign key (trigram) references trigrams(trigram),
  foreign key (document_id) references mytable(document_id)
);

Now populate it the hard way:

insert into trigram_matches
  select t.trigram, d.document_id
  from trigrams t join mytable d
    on d.textcolumn like concat('%', t.trigram, '%');

Of course this will take quite a while! But once it's done, you can search much more quickly:

select d.*
from mytable d join trigram_matches t
  on t.document_id = d.document_id
where t.trigram = 'abc'

Of course you could be searching for patterns longer than three characters, but the inverted index still helps to narrow your search a lot:

select d.*
from mytable d join trigram_matches t
  on t.document_id = d.document_id
where t.trigram = 'abc'
  and d.textcolumn like '%abcdef%';
Bill Karwin
I've reedited the question a little bit, does this still apply?
erotsppa