views:

952

answers:

7

In our desktop application, we have implemented a simple search engine using an inverted index.

Unfortunately, some of our users' datasets can get very large, e.g. taking up ~1GB of memory before the inverted index has been created. The inverted index itself takes up a lot of memory, almost as much as the data being indexed (another 1GB of RAM).

Obviously this creates problems with out of memory errors, as the 32 bit Windows limit of 2GB memory per application is hit, or users with lesser spec computers struggle to cope with the memory demand.

Our inverted index is stored as a:

Dictionary<string, List<ApplicationObject>>

And this is created during the data load when each object is processed such that the applicationObject's key string and description words are stored in the inverted index.

So, my question is: is it possible to store the search index more efficiently space-wise? Perhaps a different structure or strategy needs to be used? Alternatively is it possible to create a kind of CompressedDictionary? As it is storing lots of strings I would expect it to be highly compressible.

+2  A: 

If it's going to be 1GB... put it on disk. Use something like Berkeley DB. It will still be very fast.

Here is a project that provides a .net interface to it:

http://sourceforge.net/projects/libdb-dotnet

bobwienholt
I'd like to avoid this if possible, as it'll be simpler to have the in-memory search index. But perhaps it is not possible, but it just seems like it *should* be possible to me.
RickL
+1  A: 

I agree with bobwienholt, but If you are indexing datasets I assume these came from a database somewhere. Would it make sense to just search that with a search engine like DTSearch or Lucene.net?

Andrew Cowenhoven
Perhaps, but I guess that would be more complicated? i.e. the applicationObject's are stored across many different tables which map to different specific application objects. Ah, also our application is buffered so the in-memory dataset could be out of sync with the database.
RickL
+2  A: 

I see a few solutions:

  1. If you have the ApplicationObjects in an array, store just the index - might be smaller.
  2. You could use a bit of C++/CLI to store the dictionary, using UTF-8.
  3. Don't bother storing all the different strings, use a Trie
MSalters
For point 1) they're not stored in an array, but did you mean store the index instead of the string key? Then how do you search by the strings? Or did you mean instead of the List<ApplicationObject> to have a List<int>? I guess it might be smaller, but probably not a huge amount.
RickL
+1  A: 

I suspect you may find you've got a lot of very small lists.

I suggest you find out roughly what the frequency is like - how many of your dictionary entries have single element lists, how many have two element lists etc. You could potentially store several separate dictionaries - one for "I've only got one element" (direct mapping) then "I've got two elements" (map to a Pair struct with the two references in) etc until it becomes silly - quite possibly at about 3 entries - at which point you go back to normal lists. Encapsulate the whole lot behind a simple interface (add entry / retrieve entries). That way you'll have a lot less wasted space (mostly empty buffers, counts etc).

If none of this makes much sense, let me know and I'll try to come up with some code.

Jon Skeet
That's an interesting observation... yes, I would think that most of the lists would be very small. With your suggestion, I guess that the inverted index creation would take longer as you would have to move items between the 1-item, 2-item, etc dictionaries, but could potentially save space.
RickL
I suspect the difference in performance would be pretty small, to be honest - but yes, there would be some overhead. Definitely worth checking the distribution before coding it up though :)
Jon Skeet
One thought to make it potentially cheaper to start with: just have a single Dictionary<string, IEnumerable<ApplicationObject>>. It means having an object per value, rather than using structs to avoid that, but you'd only ever have to replace the entry in the dictionary instead of doing a remove/add
Jon Skeet
True, I will try this out tomorrow and report what difference it made.
RickL
I investigated, and think this change would have some impact although not a huge impact. e.g. about 50% of the lists are 1, 2 or 3 elements long. But overall those lists account for about 5% of the total number of elements.
RickL
A: 

Is the index only added to or do you remove keys from it as well?

Lasse V. Karlsen
Keys should be removed from the index if and when the referenced ApplicationObject is deleted.
RickL
A: 

You could take the approach Lucene did. First, you create a random access in-memory stream (System.IO.MemoryStream), this stream mirrors a on-disk one, but only a portion of it (if you have the wrong portion, load up another one off the disk). This does cause one headache, you need a file-mappable format for your dictionary. Wikipedia has a description of the paging technique.

On the file-mappable scenario. If you open up Reflector and reflect the Dictionary class you will see that is comprises of buckets. You can probably use each of these buckets as a page and physical file (this way inserts are faster). You can then also loosely delete values by simply inserting a "item x deleted" value to the file and every so often clean the file up.

By the way, buckets hold values with identical hashes. It is very important that your values that you store override the GetHashCode() method (and the compiler will warn you about Equals() so override that as well). You will get a significant speed increase in lookups if you do this.

Jonathan C Dickinson
A: 

How about using Memory Mapped File Win32 API to transparently back your memory structure?

http://www.eggheadcafe.com/articles/20050116.asp has the PInvokes necessary to enable it.

stephbu