views:

436

answers:

3

I have a database of names of people that has (currently) 35 million rows. I need to know what is the best method for quickly searching these names. The current system (not designed by me), simply has the first and last name columns indexed and uses "LIKE" queries with the additional option of using SOUNDEX (though I'm not sure this is actually used much). Performance has always been a problem with this system, and so currently the searches are limited to 200 results (which still takes too long to run). So, I have a few questions:

  1. Does full text index work well for proper names?
  2. If so, what is the best way to query proper names? (CONTAINS, FREETEXT, etc)
  3. Is there some other system (like Lucene.net) that would be better?

Just for reference, I'm using Fluent NHibernate for data access, so methods that work will with that will be preferred. I'm using SQL Server 2008 currently.

EDIT I want to add that I'm very interested in solutions that will deal with things like commonly misspelled names, eg 'smythe', 'smith', as well as first names, eg 'tomas', 'thomas'.

Query Plan

  |--Parallelism(Gather Streams)
       |--Nested Loops(Inner Join, OUTER REFERENCES:([testdb].[dbo].[Test].[Id], [Expr1004]) OPTIMIZED WITH UNORDERED PREFETCH)
            |--Hash Match(Inner Join, HASH:([testdb].[dbo].[Test].[Id])=([testdb].[dbo].[Test].[Id]))
            |    |--Bitmap(HASH:([testdb].[dbo].[Test].[Id]), DEFINE:([Bitmap1003]))
            |    |    |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([testdb].[dbo].[Test].[Id]))
            |    |         |--Index Seek(OBJECT:([testdb].[dbo].[Test].[IX_Test_LastName]), SEEK:([testdb].[dbo].[Test].[LastName] >= 'WHITDþ' AND [testdb].[dbo].[Test].[LastName] < 'WHITF'),  WHERE:([testdb].[dbo].[Test].[LastName] like 'WHITE%') ORDERED FORWARD)
            |    |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([testdb].[dbo].[Test].[Id]))
            |         |--Index Seek(OBJECT:([testdb].[dbo].[Test].[IX_Test_FirstName]), SEEK:([testdb].[dbo].[Test].[FirstName] >= 'THOMARþ' AND [testdb].[dbo].[Test].[FirstName] < 'THOMAT'),  WHERE:([testdb].[dbo].[Test].[FirstName] like 'THOMAS%' AND PROBE([Bitmap1003],[testdb].[dbo].[Test].[Id],N'[IN ROW]')) ORDERED FORWARD)
            |--Clustered Index Seek(OBJECT:([testdb].[dbo].[Test].[PK__TEST__3214EC073B95D2F1]), SEEK:([testdb].[dbo].[Test].[Id]=[testdb].[dbo].[Test].[Id]) LOOKUP ORDERED FORWARD)

SQL for above:

SELECT * FROM testdb.dbo.Test WHERE LastName LIKE 'WHITE%' AND FirstName LIKE 'THOMAS%'

Based on advice from Mitch, I created an index like this:

CREATE INDEX IX_Test_Name_DOB
ON Test (LastName ASC, FirstName ASC, BirthDate ASC)
INCLUDE (and here I list the other columns)

My searches are now incredibly fast for my typical search (last, first, and birth date).

A: 

If you create an index on the first name and last name columns, then exact match searches and prefix searches using LIKE will become blazingly fast.

(In MySQL, "The index also can be used for LIKE comparisons if the argument to LIKE is a constant string that does not start with a wildcard character." I think MS SQL has a similar rule, but check the MS SQL documentation to be sure.)

To speed up SoundEx searches, store the SoundEx version of the first name and last name new columns, and create indices on those columns.

Ken Bloom
+3  A: 

Depends what your LIKE queries look like.

If you are searching for LIKE '%abc%' then no index can be utilised, whereas when searching for LIKE 'abc%' an index can be used. Also, if the index(es) on First and Last name are not 'covering' the emitted query then key lookups (Bookmark Lookups) will be performed and significantly impact performance.

Are your indexes rebuilt regularly?

Do you have an example query plan?

Update: A covering index for a query is one which can be used to perform the WHERE criteria and also has all of the columns required to satisfy the rest of the query such as the SELECT column list.

Using Covering Indexes to Improve Query Performance

Update: Even if you create a composite index on (Lastname, Firstname) (since lastname should be more selective), a lookup for all the other columns (the '*' column list) will still be required into the tables clustered index.

Mitch Wheat
Indexes will be rebuilt regularly, probably weekly. I'm adding records at the rate of roughly 5,000 per day.Ha, looks like the current system isn't using "LIKE" at all, evidently too slow. So, I'd say 'abc%' ought to be an improvement.
Matthew Talbert
What do you mean by 'covering'?
Matthew Talbert
This is really helpful, Mitch. I'm working on getting an example query plan for you. So, should I create a single index that contains all of the columns I'm interested in?
Matthew Talbert
I've added the query execution plan. Hopefully it's what you need.
Matthew Talbert
@Matthew Talbert: well, it's a trade-off, and depends on several factors. Wide indexes are generally not a good idea. You can use SQL Server 2005 onwards INCLUDE part of a CREATE INDEX definition to create a covering index.
Mitch Wheat
+1  A: 

I don't like soundex much. I think newer iterations of the algorithm are better, but you are hashing every word in the English language down to a fairly small hash. This tends to generate a ton of false matches over time. I've read that metaphone and it's successor double metaphone are better, but I don't have direct experience with them.

Mitch's coverage of like is pretty thorough, so I'm not going to repeat it.

Donnie
Thanks for the info on soundex.
Matthew Talbert