Fast keyword lookup

A:

The list does not need to be updated often

I say if it ever needs to be updated it does not belong in the source code.

Marek Karbarz 2009-11-25 01:52:56

maybe some minor changes once a month

program10 2009-11-25 01:54:56

@program10: Marel K is right. Especially if it's once per month; that is far too frequent to be even remotely considered for being embedded in the source code.

Jason 2009-11-25 01:59:55

that's a lot of source maintenance in the long run and completely unnecessary downtime when the new version is deployed. Like codymanix has mentioned - just load it from SQL and cache it - it's a much more maintainable option than having to modify the source code just to update string values.

Marek Karbarz 2009-11-25 02:00:07

+4 A:

It doesn't matter for the performance where you store this data.

If you start your program, you load the string-array once from which datastore you stored it. And then you can use this array all the time until you quit the program.

codymanix 2009-11-25 01:57:11

This is the correct answer. Where he persists the data is effectively irrelevant. The correct question is what is the best in-memory data structure for fast lookups? An array is fine if it's sorted so that binary search can be used (so now we're `O(log n)` where `n` is the number of keywords). If this isn't fast enough (after profiling!) a trie could be considered.

Jason 2009-11-25 02:07:46

Yes this is pedantic... But, isn't a binary search on a set of strings really O(log(N)*log(K)) where N is the number of words, and K is the median word length?

Jason D 2009-11-25 04:00:05

Worst case it's `O(m log n)` where `m` is the maximum word length and `n` is the number of keywords.

Jason 2009-11-25 14:32:52

A:

The hardcoded list is faster. A database hit to retrieve the list will undoubtedly be slower than pulling the list from an in-memory object.

As for which datatype to store the values in, an array would probably be faster and take up less memory than a List, but trivially so.

Kevin Pang 2009-11-25 01:57:35

Not if it's cached locally after an initial retrieval (on startup) from the database.

Jason 2009-11-25 02:00:28

@Jason - True for subsequent hits. However, the first hit will be slower.

Kevin Pang 2009-11-25 02:01:37

@Kevin Pang: The first hit is irrelevant.

Jason 2009-11-25 02:04:13

@Jason Most likely, but you don't know that for sure.

Kevin Pang 2009-11-25 04:43:54

A:

If the list is largely static and you can afford to spend some time in prep (i.e. on application start), you'd probably be best off storing the list of keywords in a text file, then using say a B* tree to store the keywords internally (assuming you only care about exact match and not partial matching or Levenshtein distance).

Simon Righarts 2009-11-25 01:57:37

A B*-tree is good if he's doing on-the-fly insertions and deletions, but for a static list a sorted array is fine. The complexity of a B*-tree provides no advantage over a sorted array.

Jason 2009-11-25 02:05:32

A B* tree isn't necessarily the right answer, but the reason I suggested a tree structure over a sorted array is speed of searching. Searching a tree is O(log N), whereas searching an array (even a sorted one) is O(N).

Simon Righarts 2009-11-25 02:09:58

Where can I read more about partial matching or Levenshtein distance? Might need to use those later

program10 2009-11-25 02:11:46

@Simon Righarts: You might want to rethink that last statement about searching a sorted array.

Jason 2009-11-25 02:19:24

@Jason: Sigh. I should make thinking a prerequisite to posting. (Somehow I got confused between searching an unsorted array - O(N) - and searching a sorted array - which is O(log N) with binary search).@program10: http://en.wikipedia.org/wiki/Levenshtein_distance (and the See Also: links down the bottom) should give you a reasonable starter.

Simon Righarts 2009-11-25 02:29:42

+4 A:

IMO, If the list doesn't get update often, store it on a file(text/xml) then cached it in your application so that it would be faster for the next requests.

jerjer 2009-11-25 02:01:59

+1 - This is the ideal situation to use cache.Have a look http://stackoverflow.com/questions/1308354/asp-net-caching-vs-static-variable-for-storing-a-dictionary

Sandy 2009-11-25 04:00:02

+2 A:

Okay, to respond to your edit (and basically lifting my comment into an answer):

Specify in advance the performance that you are expecting.
Code your application against a sorted array and using a binary search to search the array for a keyword. This is very simple to implement and gives decent performance. Then profile to see if it matches the performance that you demand. If this performance is acceptable, move on. The worst-case performance here is O(m log n) where n is the number of keywords and m is the maximum length of your keywords.
If the performance in step two is not acceptable, use a trie (also known as a prefix tree). The expected performance here is m where m is the maximum length of your keywords. Profile to see if this meets your expected performance. If it does not, revisit your performance criteria; they might have been unreasonable.
If you are still not meeting your performance specifications, consider using a hashtable (in .NET you would use a HashSet<string>. While a hashtable will have worse worst-case performance, it could have better average case performance (if there are no collisions a hashtable lookup is O(1) while the hash computing function is O(m) where m is the maximum length of your keywords). This might be faster (on average) but probably not noticeably so.

You might even consider skipping directly to the last step (as it's less complex than the former). It all depends on your needs. Tries have the advantage that you can easily spit out the closest matching keyword, for example.

The important thing here is to have a specification of your performance requirements and to profile! Use the simplest implementation that meets your performance requirements (for maintainability, readability and implementability (if it's not, it's a word now!))

Jason 2009-11-25 02:30:27

thanks for your detailed response. Im new to c# and I was going to just write a reg exp "wordA|wordB|word\sC is this going to be fast enough with 2000 words? would you mind providing some sample code for the methods you mentioned? Im still learning. thank you

program10 2009-11-25 08:41:40

Do not use a regex like that. It's not the right tool for the job and maintaining a hard-coded regex string like that would be a nightmare.

Jason 2009-11-25 13:36:57

ansaurus

tags:

views:

answers:

Fast keyword lookup

related questions