views:

131

answers:

6

I'm writing a program that matches a user submitted query against a list of keywords. The list has about 2000 words and performance is most important.

Old

Is it faster to store this list in a SQL table or hard code it in the source code? The list does not need to be updated often.

If SQL table is faster when which data types would be the best? (Int, Nvarchar?)

If hardcoded list is faster what data type would be the best? (List?)

Any suggestions?

what is the best in-memory data structure for fast lookups?

A: 

The list does not need to be updated often

I say if it ever needs to be updated it does not belong in the source code.

Marek Karbarz
maybe some minor changes once a month
program10
@program10: Marel K is right. Especially if it's once per month; that is far too frequent to be even remotely considered for being embedded in the source code.
Jason
that's a lot of source maintenance in the long run and completely unnecessary downtime when the new version is deployed. Like codymanix has mentioned - just load it from SQL and cache it - it's a much more maintainable option than having to modify the source code just to update string values.
Marek Karbarz
+4  A: 

It doesn't matter for the performance where you store this data.

If you start your program, you load the string-array once from which datastore you stored it. And then you can use this array all the time until you quit the program.

codymanix
This is the correct answer. Where he persists the data is effectively irrelevant. The correct question is what is the best in-memory data structure for fast lookups? An array is fine if it's sorted so that binary search can be used (so now we're `O(log n)` where `n` is the number of keywords). If this isn't fast enough (after profiling!) a trie could be considered.
Jason
Yes this is pedantic... But, isn't a binary search on a set of strings really O(log(N)*log(K)) where N is the number of words, and K is the median word length?
Jason D
Worst case it's `O(m log n)` where `m` is the maximum word length and `n` is the number of keywords.
Jason
A: 

The hardcoded list is faster. A database hit to retrieve the list will undoubtedly be slower than pulling the list from an in-memory object.

As for which datatype to store the values in, an array would probably be faster and take up less memory than a List, but trivially so.

Kevin Pang
Not if it's cached locally after an initial retrieval (on startup) from the database.
Jason
@Jason - True for subsequent hits. However, the first hit will be slower.
Kevin Pang
@Kevin Pang: The first hit is irrelevant.
Jason
@Jason Most likely, but you don't know that for sure.
Kevin Pang
A: 

If the list is largely static and you can afford to spend some time in prep (i.e. on application start), you'd probably be best off storing the list of keywords in a text file, then using say a B* tree to store the keywords internally (assuming you only care about exact match and not partial matching or Levenshtein distance).

Simon Righarts
A B*-tree is good if he's doing on-the-fly insertions and deletions, but for a static list a sorted array is fine. The complexity of a B*-tree provides no advantage over a sorted array.
Jason
A B* tree isn't necessarily the right answer, but the reason I suggested a tree structure over a sorted array is speed of searching. Searching a tree is O(log N), whereas searching an array (even a sorted one) is O(N).
Simon Righarts
Where can I read more about partial matching or Levenshtein distance? Might need to use those later
program10
@Simon Righarts: You might want to rethink that last statement about searching a sorted array.
Jason
@Jason: Sigh. I should make thinking a prerequisite to posting. (Somehow I got confused between searching an unsorted array - O(N) - and searching a sorted array - which is O(log N) with binary search).@program10: http://en.wikipedia.org/wiki/Levenshtein_distance (and the See Also: links down the bottom) should give you a reasonable starter.
Simon Righarts
+4  A: 

IMO, If the list doesn't get update often, store it on a file(text/xml) then cached it in your application so that it would be faster for the next requests.

jerjer
+1 - This is the ideal situation to use cache.Have a look http://stackoverflow.com/questions/1308354/asp-net-caching-vs-static-variable-for-storing-a-dictionary
Sandy
+2  A: 

Okay, to respond to your edit (and basically lifting my comment into an answer):

  1. Specify in advance the performance that you are expecting.

  2. Code your application against a sorted array and using a binary search to search the array for a keyword. This is very simple to implement and gives decent performance. Then profile to see if it matches the performance that you demand. If this performance is acceptable, move on. The worst-case performance here is O(m log n) where n is the number of keywords and m is the maximum length of your keywords.

  3. If the performance in step two is not acceptable, use a trie (also known as a prefix tree). The expected performance here is m where m is the maximum length of your keywords. Profile to see if this meets your expected performance. If it does not, revisit your performance criteria; they might have been unreasonable.

  4. If you are still not meeting your performance specifications, consider using a hashtable (in .NET you would use a HashSet<string>. While a hashtable will have worse worst-case performance, it could have better average case performance (if there are no collisions a hashtable lookup is O(1) while the hash computing function is O(m) where m is the maximum length of your keywords). This might be faster (on average) but probably not noticeably so.

You might even consider skipping directly to the last step (as it's less complex than the former). It all depends on your needs. Tries have the advantage that you can easily spit out the closest matching keyword, for example.

The important thing here is to have a specification of your performance requirements and to profile! Use the simplest implementation that meets your performance requirements (for maintainability, readability and implementability (if it's not, it's a word now!))

Jason
thanks for your detailed response. Im new to c# and I was going to just write a reg exp "wordA|wordB|word\sC is this going to be fast enough with 2000 words? would you mind providing some sample code for the methods you mentioned? Im still learning. thank you
program10
Do not use a regex like that. It's not the right tool for the job and maintaining a hard-coded regex string like that would be a nightmare.
Jason