ansaurus

Question

Process optimization for large sets of data

Answer 1

+5 A:

Never use a LIKE match starting with "%" o large sets of data - it can not use the table index on that field and will do a table scan. This is your source of slowness.

The only matches that can use the index are the ones starting with hardcoded string (e.g. keyword LIKE '{term} %').

To work around this problem, create a new indexing table (not to be confused with database's table index) mapping individual terms to keyword strings contining those terms; then your keyword LIKE '% {term} %' becomes t1.keyword = index_table.keyword and index_table.term="{term}".

DVK 2010-08-27 16:18:07

I don't follow what you mean. How does the mapping take place? Is it just a duplicate of the keyword table with an index on the text of the keyword?

cdburgess 2010-08-27 16:34:12

I think that DVK is assuming that your stored keywords consist of one or more words, e.g. "Sample Keyword One", and that by "term" you mean "word". So he is suggesting creating a table that would contain each word with a link back to the keyword(s) that contain it. This might be appropriate for your EQUAL and TOKEN cases, but not for your CONTAINS case, since it looks like in that case you want to match any substring, even if it is not a complete word.

Dave Costa 2010-08-27 16:47:03

@Dave: Thanks for the clarification. That makes complete sense. Yes, you are correct that contains could potentially be a sub string of the word. But this is a great idea to resolve the other issues.

cdburgess 2010-08-27 16:48:54

Will this work if the term I am looking for is `holiday inn`?

cdburgess 2010-08-27 17:28:46

@dsburgess - you might need much mroe involved logic than my example in the asnwer, but yes. And for substrings, then you need to store substrings in the indexing table instead of words.

DVK 2010-08-27 18:54:26

Answer 2

+2 A:

The info is insufficient to give any concrete advice. If the expensive LIKE matching is unavoidable then the only thing I see at the moment is this:

Currently, this process can take hours/days to process depending on the number of keywords in the list of keywords to search for.

Have you tried to cache the results of the queries in a table? Keyed by the input keyword?

Because I do not believe that the whole data set, all keywords can change overnight. And since they do not change very often it makes sense to simply keep the results in a extra table precomputed so that future queries for the keyword can be resolved via cache instead of going again over the 30Mil entries. Obviously, some sort of periodic maintenance has to be done on the cache table: when keywords are modified/deleted and when the lists are modified the cache entries have to be updated recomputed. To simplify the update, one would keep in the cache table also the ID of the original rows in keyword_list table which contributed the results.

To the UPDATE: Insert data into the keyword_list table already lower-cased. Use extra row if the original case is needed for later.

In the past I have participated in the design of one ad system. I do not remember all the details but the most striking difference is that we were tokenizing everything and giving every unique word an id. And keywords were not free form - they were also in DB table, were also tokenized. So we never actually matched the keywords as strings: queries were like:

select AD.id
from DICT, AD
where 
  DICT.word = :input_word and
  DICT.word_id = AD.word_id

DICT is a table with words and AD (analogue of your keyword_list) with the words from ads.

Essentially one can summarize the problem you experience as "full table scan". This is pretty common issue, often highlighting poor design of data layout. Search the net for more information on what can be done. SO has many entries too.

Dummy00001 2010-08-27 16:41:07

Answer 3

+2 A:

I think the problem is one of how the keywords are stored. If I'm interpreting your code correctly, the KEYWORD column is made up of a string of blank-separated keyword values, such as

KEYWORD1 KEYWORD2 KEYWORD3

Because of this you're forced to use LIKE to do your searches, and that's probably the souce of the slowness.

Although I realize this may be somewhat painful, it might be better to create a second table, perhaps called KEYWORDS, which would contain the individual keywords which relate to a given base table record (I'll refer to the base table as PPC since I don't know what's it really called). Assuming that your current base table looks like this:

CREATE TABLE PPC
 (ID_PPC       NUMBER PRIMARY KEY,
  KEYWORD      VARCHAR2(1000),
  <other fields>...);

What you could do would be to rebuild the tables as follows:

CREATE TABLE NEW_PPC
 (ID_PPC       NUMBER PRIMARY KEY,
  <other fields>...);

CREATE TABLE NEW_PPC_KEYWORD
 (ID_NEW_PPC       NUMBER,
  KEYWORD      VARCHAR2(25),  -- or whatever is appropriate for a single keyword
  PRIMARY KEY (ID_NEW_PPC, KEYWORD));

CREATE INDEX NEW_PPC_KEYWORD_1
  ON NEW_PPC_KEYWORD(KEYWORD);

You'd populate the NEW_PPC_KEYWORD table by pulling out the individual keywords from the old PPC.KEYWORD field, putting them into the NEW_PPC_KEYWORD table. With only one keyword in each record in NEW_PPC_KEYWORD you could now use a simple join to pull all the records in NEW_PPC which had a keyword by doing something like

SELECT P.*
  FROM NEW_PPC P
INNER JOIN NEW_PPC_KEYWORD K
  ON (K.ID_NEW_PPC = P.ID_NEW_PPC)
WHERE K.KEYWORD = '<whatever>';

Share and enjoy.

Bob Jarvis 2010-08-27 16:45:10

Note that in the CONTAINS case, he's looking for any substring match

Dave Costa 2010-08-27 16:51:10

Yes Dave. However, this could speed up the other queries.

cdburgess 2010-08-27 16:58:41

How will this handle a multiple term search: `WHERE k.keyword = 'holiday inn'` for example?

cdburgess 2010-08-27 17:29:26

@cdburgess: the response to that would cost you the price of a license for one of the professional ad engines. LOL. In the scheme depicted by Bob above, you would add the 'holiday inn' as a row into the `NEW_PPS_KEYWORD` table. High performance + large data set == no trivial solutions, fitting SO response format.

Dummy00001 2010-08-27 18:10:00

Answer 4

+1 A:

You could look into Oracle Text indexing. It is designed to support the kind of in-text search you are talking about.

Dave Costa 2010-08-27 16:51:15

We have actually considered this as an option. However, with the other tasks we perform on the table, it slows the other processes down considerably. This table is in flux consistently.

cdburgess 2010-08-27 17:04:26

Answer 5

+3 A:

I know that mine approach can look like heresies for RDBMS guys but I verified it many times in practice and there is no magic. One should just know little bit about possible IO and processing rates and some of simple calculation. In short, RDBMS is not right tool for this sort of processing.

From mine experience perl is able do regexp scan roughly in millions per second. I don't know how fast you are able dump it from database (MySQL can up to 200krows/s so you can dump all your keywords in 2.5 min, I know that Oracle is much worse here but I hope it is not more than ten times i.e. 25 min). If your data are average 20 chars your dump will be 600MB, for 100 chars it is 3GB. It means that with slow 100MB/s HD your IO will take from 6s to 30s. (All involved IO is sequential!) It is almost nothing in comparison with time of dump and processing in perl. Your scan can slow down to 100k/s depending of number of keywords you would like to remove (I have experienced regexp with 500 branching patterns with this speed) so you can process resulting data in less than 5 minutes. If resulting cardinality will not be huge (in tens of hundreds) output IO should not be problem. Anyway your processing should be in minutes, not hours. If you generate whole keyword values for deletion you can use index in delete operation, so you will generate series of DELETE FROM <table> WHERE keyword IN (...) stuffed with keywords to remove in amount up to maximal length of SQL statement. You can also try variant where you will upload this data to temporary table and then use join. I don't know what would be faster in Oracle. It would take about 10 minutes in MySQL. You are unlucky that you have to deal with Oracle but you should be able remove hundreds of {term}'s in less than hour.

P.S.: I would recommend you to use something with better regular expressions like http://code.google.com/p/re2/ (included in V8 aka node.js) or new binary module in Erlang R14A but weak regexp engine in perl would not be weak point in this task, it would be RDBMS.

Hynek -Pichi- Vychodil 2010-08-27 17:24:08

Thanks for the suggestion. We have toyed with this idea as well. While it may not please a DBA, it could prove to be much faster.

cdburgess 2010-08-27 17:33:10

SQL lover that I am, I have to agree: sometimes a custom program is the way to go.

Justin K 2010-08-27 18:15:33

I agree. Making an optimal design in SQL is quite complicated while is very easy in normal programming languages.

Dummy00001 2010-08-27 19:11:55

Can't perl connect to Oracle? Why do you have to dump the data to a flat file?

TTT 2010-08-28 11:55:35

@TTT: You should try it first. It is much slower than proposed approach. Try it and than compare. You will learn a lot.

Hynek -Pichi- Vychodil 2010-08-29 11:53:18

I can't program in Perl so I can't try it. But I fail to understand why Perl's regexps work faster on data in an flat file than the same data, read from a database. I think that I would solve a problem like this with writing a java stored proc. There is a JavaVirtualMachine inside the Oracle database. Or I would use Oracle Text Indexing, you don't have to rebuild your text indexes in sync, you can schedule the rebuilds. I don't see what you gain by exporting data to a flat file unless Perl can't connect in a decent way to Oracle.

TTT 2010-08-29 20:38:48

I don't want to learn and use Perl if you can't connect to Oracle in a fast way, I use mostly C#, C# is fast and there are special drivers for connecting to Oracle.

TTT 2010-08-29 20:52:30

@TTT: It's simple. Populate your database with 30M rows data and try it. JavaVirtualMachine inside the Oracle database sounds promising but anyway you can be surprised. Try it and tell as result, it is best what you can do.

Hynek -Pichi- Vychodil 2010-08-29 21:56:01

I will not try because I don't know perl.

TTT 2010-08-30 18:11:24

Try C# or Java or what ever you want and then tell us how long it was take and on which HW. I think we can imagine what it means.

Hynek -Pichi- Vychodil 2010-09-01 16:58:45

Answer 6

+2 A:

Your explain plan says this query should take a minute, but it's actually taking hours? A simple test on my home PC verifies that a minute seems reasonable for this query. And on a server with some decent IO this should probably only take a few seconds.

Is the problem that you're running the same query dozens of times sequentially for different keywords? If so, you need to combine all the searches together so you only scan the table once.

jonearles 2010-08-28 04:07:06

ansaurus

tags:

views:

answers:

Process optimization for large sets of data

related questions