views:

267

answers:

4

I need to extract information from a text field which can contain one of many values. The SQL looks like:

SELECT fieldname
FROM table
WHERE bigtextfield LIKE '%val1%'
   OR bigtextfield LIKE '%val2%'
   OR bigtextfield LIKE '%val3%'
   .
   .
   .
   OR bigtextfield LIKE '%valn%'

My question is: how efficient is this when the number of values approaches the hundreds, and possibly thousands? Is there a better way to do this?

One solution would be to create a new table/column with just the values I'm after and doing the following:

SELECT fieldname
FROM othertable
WHERE value IN ('val1', 'val2', 'val3', ... 'valn')

Which I imagine is a lot more efficient as it only has to do exact string matching. The problem with this is that it will be a lot of work keeping this table up to date.

btw I'm using MS SQL Server 2005.

+4  A: 

This functionality is already present in most SQL engines, including MS SQL Server 2005. It's called full-text indexing; here are some resources:

intgr
How will it help to search with leading wildcards?
Quassnoi
It doesn't, but the author already provided a rudimentary "full text index" as a satisfactory solution in his question: `SELECT fieldname FROM othertable WHERE value IN ('val1', 'val2', 'val3', ... 'valn')`
intgr
When I just read it I was left with impression that the `@op` wanted to match something like 'nval12' with '%val1%'. But now I see this is probably not the case and what he needs in fact is denormalizing the table. +1, since in this case your answer is the most efficient way to deal with the current structure.
Quassnoi
A: 

This will inevitable require a fullscan (over the table or over an index) with a filter.

The IN condition won't help here, since it does not work on LIKE

You could do something like this:

SELECT  *
FROM    master
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    values
        WHERE   name LIKE '%' + value + '%'
        )

, but this hardly be more efficient.

All literal conditions will be transformed to a CONSTANT SCAN which is just like selecting from the same table, but built in memory.

Quassnoi
+1  A: 

I don't think the main problem is the number of criteria values - but the sheer fact that a WHERE clause with bigtextfield LIKE '%val1%' can never really be very efficient - even with just a single value.

The trouble is the fact that if you have a placeholder like "%" at the beginning of your search term, all the indices are out the window and cannot be used anymore.

So you're basically just searching each and every entry in your table doing a full table scan in the process. Now your performance basically just depends on the number of rows in your table....

I would support intgr's recommendation - if you need to do this frequently, have a serious look at fulltext indexing.

marc_s
A: 

The best solution for this is to redesign and get rid of that field that is storing multiple values and make it a related table instead. This violates one of the first rules of database design.

You should not be storing multiple values in one field and dead slow queries like this are the reason why. If you can't do that then full-test indexing is your only hope.

HLGEM
I didn't design the database, I can see how the current design is flawed, I was just wondering if I had any other option but to redo it!
Matt Grum