tags:

views:

45

answers:

2

I'm trying to search through phone numbers for any phone number containing some series of digits.

Obviously the following is going to be slow:

 Select * from customer where phone like '%1234%'

I need the wildcards, because the users were allowed to enter any data in the database, so it might have country codes, a leading 1 (as in 1-800) or a trailing extension (which was sometimes just separated by a space.

Note: I've already created 'cleaned' phone numbers by removing all non-digit characters, so I don't have to worry about dashes, spaces, and the like.

Is there any magic to get a search like this to run in reasonable time?

+2  A: 

If you're using MySQL, you're looking for the Full Text Search functionality http://dev.mysql.com/doc/refman/5.1/en/fulltext-search.html

It specifically optimizes queries like the one you have listed, and is pretty darn fast once set up. You'll need your data in MySQL, and it must be in a MyISAM table (not InnoDB or other.)

I've used it in production and it works pretty well.

Joshua
Full text search is definitely the way to go.
James
Does full text search work within the individual words?
Will Hartung
In a limited fashion... somewhat of a caveat. See the * operator here http://dev.mysql.com/doc/refman/5.1/en/fulltext-boolean.html. Basically, you can use wildcards, but only at the end of a search term, not in front. I can see that being a barrier to the OPs search problem if the spaces are all taken out
Joshua
+1  A: 

Nope.

You could make an index table if you like. It'll be a bit expensive, but perhaps it's worthwhile.

So you could turn a phone number: 2125551212 in to a zillion references based on unique substrings and build an inverted index from that:

1
2
5
12
21
25
51
55
121
125
212
255
512
551
555
1255
2125
2555
5121
5512
5551
12555
21255
25551
55121
55512
125551
212555
255512
555121
1255512
2125551
2555121
12555121
21255512
212555121
2125551212

So, for example:

create table myindex (
    key varchar(10) not null,
    datarowid integer not null references datarows(id)
);
create index i1myindex(key);
insert into myindex values('1255', datarow.id);

Depending on how deep you want to go.

For example, you could go only 4 deep, and then scan the results of those with the 4 numbers.

So, for example, if you have "%123456%", you can ask for keys with "1234" and then apply the full expression on the result set.

Like:

select d.* from datarows d, myindex i where i.datarowid = d.id and i.key = '1234' and d.phone like "%123456%";

The index should help you narrow down the lot quite quickly and the db will scan the remainder.

Obviously, you'll be generating some data here, but if you query a lot, you could some performance here.

Will Hartung