tags:

views:

81

answers:

4

I have a mysql table containing 40 million records that is being populated by a process over which I have no control. Data is added only once every month. This table needs to be search-able by the Name column. But the name column contains the full name in the format 'Last First Middle'.

In the sphinx.conf, I have

sql_query = SELECT Id, OwnersName,
substring_index(substring_index(OwnersName,' ',2),' ',-1) as firstname, 
substring_index(OwnersName,' ',2) as lastname
FROM table1

How do I use sphinx search to search by firstname and/or lastname? I would like to be able to search for 'Smith' in only the first name?

+4  A: 

Per-row functions in SQL queries are always a bad idea for tables that may grow large. If you want to search on part of a column, it should be extracted out to its own column and indexed.

I would suggest, if you have power over the schema (as opposed to the population process), inserting new columns called OwnersFirstName and OwnersLastName along with an update/insert trigger which extracts the relevant information from OwnersName and populats the new columns appropriately.

This means the expense of figuring out the first name is only done when a row is changed, not every single time you run your query. That is the right time to do it.

Then your queries become blindingly fast. And, yes, this breaks 3NF, but most people don't realize that it's okay to do that for performance reasons, provided you understand the consequences. And, since the new columns are controlled by the triggers, the data duplication that would be cause for concern is "clean".

Most problems people have with databases is the speed of their queries. Wasting a bit of disk space to gain a large amount of performance improvement is usually okay.

If you have absolutely no power over even the schema, another possibility is to create your own database with the "correct" schema and populate it periodically from the real database. Then query yours. That may involve a fair bit of data transfer every month however so the first option is the better one, if allowed.

paxdiablo
@Pax Can you eloborate this further after splitting by firstname and lastname? Do I setup an individual index for firstname and lastname?
Shoan
If you want to search on them individually (and it sounds like you do), then yes - one index each.
paxdiablo
A: 

You could use substring to get the parts of the field that you want to search in, but that will slow down the process. The query can not use any kind of index to do the comparison, so it has to touch each record in the table.

The best would be not to store several values in the same field, but put the name components in three separate fields. When you store more than one value in a fields it's almost always some problems accessing the data. I see this over and over in different forums...

Guffa
A: 

This is an intractable problrm because fulll names can contains prefixes, suffixes, middle names and no middle names, composite first and last names with and without hyphens, etc. There is no reasonable way to do this with 100% reliability

Charles Bretana
Yes, but as mentioned in the question, the format for the field is 'Last First Middle'.
Shoan
+1  A: 

Judging by the other answers, I may have missed something... but to restrict a search in Sphinx to a specific field, make sure you're using the extended (or extended2) match mode, and then use the following query string: @firstname Smith.

pat