views:

2445

answers:

8

I have a bunch (750K) of records in one table that I have to see they're in another table. The second table has millions of records, and the data is something like this:

Source table
9999-A1B-1234X, with the middle part potentially being longer than three digits

Target table
DescriptionPhrase9999-A1B-1234X(9 pages) - yes, the parens and the words are in the field.

Currently I'm running a .net app that loads the source records, then runs through and searches on a like (using a tsql function) to determine if there are any records. If yes, the source table is updated with a positive. If not, the record is left alone.

the app processes about 1000 records an hour. When I did this as a cursor sproc on sql server, I pretty much got the same speed.

Any ideas if regular expressions or any other methodology would make it go faster?

A: 

First thing is to make sure you have an index for that column on the searched table. Second is to do the LIKE without a % sign on the left side. Check the execution plan to see if you are not doing a table scan on every row.

As le dorfier correctly pointed out, there is little hope if you are using a UDF.

Otávio Décio
They're using a TSQL function, which alone will make optimization impossible. Even a table scan will outperform a cursor by orders of magnitude.
le dorfier
Yikes. You are correct, never mind.
Otávio Décio
+3  A: 

First I would redesign if at all possible. Better to add a column that contains the correct value and be able to join on it. If you still need the long one. you can use a trigger to extract the data into the column at the time it is inserted.

If you have data you can match on you need neither like '%somestuff%' which can't use indexes or a cursor both of which are performance killers. This should bea set-based task if you have designed properly. If the design is bad and can't be changed to a good design, I see no good way to get good performance using t-SQl and I would attempt the regular expression route. Not knowing how many different prharses and the structure of each, I cannot say if the regular expression route would be easy or even possible. But short of a redesign (which I strongly suggest you do), I don't see another possibility.

BTW if you are working with tables that large, I would resolve to never write another cursor. They are extremely bad for performance especially when you start taking about that size of record. Learn to think in sets not record by record processing.

HLGEM
Notice he has managed to find another technique to make it as slow as a cursor. That's at least one type of accomplishment. I suspect it's the search involving a UDF.
le dorfier
+5  A: 

What about doing it all in the DB, rather than pulling records into your .Net app:

UPDATE source_table s SET some_field = true WHERE EXISTS
(
     SELECT target_join_field FROM target_table t 
     WHERE t.target_join_field LIKE '%' + s.source_join_field + '%'
)

This will reduce the total number of queries from 750k update queries down to 1 update.

mbeckish
A: 

There are lots of ways to skin the cat - I would think that first it would be important to know if this is a one-time operation, or a regular task that needs to be completed regularly.

Not knowing all the details of you problem, if it was me, at this was a one-time (or infrequent operation, which it sounds like it is), I'd probably extract out just the pertinent fields from the two tables including the primary key from the source table and export them down to a local machine as text files. The files sizes will likely be significantly smaller than the full tables in your database.

I'd run it locally on a fast machine using a routine written in something like 'C'/C++ or another "lightweight" language that has raw processing power, and write out a table of primary keys that "match", which I would then load back into the sql server and use it as a basis of an update query (i.e. update source table where id in select id from temp table).

You might spend a few hours writing the routine, but it would run in a fraction of the time you are seeing in sql.

By the sounds of you sql, you may be trying to do 750,000 table scans against a multi-million records table.

Tell us more about the problem.

EJB
+1  A: 

Try this --

update SourceTable  
set ContainsBit = 1  
from SourceTable t1      
  join (select TargetField from dbo.TargetTable t2) t2   
    on charindex(t1.SourceField, t2.TargetField) > 0
Dan R.
Still slow, but much less so than a cursor or routing through a client app.
Joel Coehoorn
A: 

Holy smoke, what great responses!

system is on disconnected network, so I can't copy paste, but here's the retype

Current UDF:

Create funtion CountInTrim 
(@caseno varchar255)
 returns int
as 
Begin
declare @reccount int
select @reccount = count(recId) from targettable where title like '%' + @caseNo +'%'
return @reccount
end

Basically, if there's a record count, then there's a match, and the .net app updates the record. The cursor based sproc had the same logic.

Also, this is a one time process, determining which entries in a legacy record/case management system migrated successfully into the new system, so I can't redesign anything. Of course, developers of either system are no longer available, and while I have some sql experience, I am by no means an expert.

I parsed the case numbers from the crazy way the old system had to make the source table, and that's the only thing in common with the new system, the case number format. I COULD attempt to parse out the case number in the new system, then run matches against the two sets, but with a possible set of data like:

DescriptionPhrase1999-A1C-12345(5 pages)
Phrase/Two2000-A1C2F-5432S(27 Pages)
DescPhraseThree2002-B2B-2345R(8 pages)

Parsing that became a bit more complex so I thought I'd keep it simpler.

I'm going to try the single update statement, then fall back to regex in the clr if needed.

I'll update the results. And, since I've already processed more than half the records, that should help.

kaplooeymom
What I'd do at this point is to continue doing what you've started - iterate through with various approximations, each catching more records.
le dorfier
A: 

One thing to be aware of with using a single update (mbeckish's answer) is that the transaction log (enabling a rollback if the query becomes cancelled) will be huge. This will drastically slow down your query. As such it is probably better to proces them in blocks of 1,000 rows or such like.

Also, the condition (b.field like '%' + a.field + '%') will need to check every single record in b (millions) for every record in a (750,000). That equates to more than 750 billion string comparisons. Not great.

The gut feel "index stuff" won't help here either. An index keeps things in order, so the first character(s) dictate the position in the index, not the ones you're interested in.

First Idea

For this reason I would actually consider creating another table, and parsing the long/messy value into something nicer. An example would be just to strip off any text from the last '(' onwards. (This assumes all the values follow that pattern) This would simplify the query condition to (b.field like '%' + a.field)

Still, an index wouldn't help here either though as the important characters are at the end. So, bizarrely, it could well be worth while storing the characters of both tables in reverse order. The index on you temporary table would then come in to use.

It may seem very wastefull to spent that much time, but in this case a small benefit would yield a greate reward. (A few hours work to halve the comparisons from 750billion to 375billion, for example. And if you can get the index in to play you could reduce this a thousand fold thanks to index being tree searches, not just ordered tables...)

Second Idea

Assuming you do copy the target table into a temp table, you may benefit extra from processing them in blocks of 1000 by also deleting the matching records from the target table. (This would only be worthwhile where you delete a meaningful amount from the target table. Such that after all 750,000 records have been checked, the target table is now [for example] half the size that it started at.)

EDIT:
Modified Second Idea

  1. Put the whole target table in to a temp table.

  2. Pre-process the values as much as possible to make the string comparison faster, or even bring indexes in to play.

  3. Loop through each record from the source table one at a time. Use the following logic in your loop...

    DELETE target WHERE field LIKE '%' + @source_field + '%' IF (@@row_count = 0) [no matches] ELSE [matches]

The continuous deleting makes the query faster on each loop, and you're only using one query on the data (instead of one to find matches, and a second to delete the matches)

Dems
A: 

Try either Dan R's update query from above ;

update SourceTable
set ContainsBit = 1
from SourceTable t1
join (select TargetField from dbo.TargetTable t2) t2
on charindex(t1.SourceField, t2.TargetField) > 0

Alternatively, if the timeliness of this is important and this is sql 2005 or later, then this would be a classic use for a calculated column using SQL CLR code with Regular Expressions - no need for a standalone app.

dan