ansaurus

Question

In-memory search index for application takes up too much memory - any suggestions?

Answer 1

+2 A:

If it's going to be 1GB... put it on disk. Use something like Berkeley DB. It will still be very fast.

Here is a project that provides a .net interface to it:

http://sourceforge.net/projects/libdb-dotnet

bobwienholt 2008-10-21 14:59:16

I'd like to avoid this if possible, as it'll be simpler to have the in-memory search index. But perhaps it is not possible, but it just seems like it *should* be possible to me.

RickL 2008-10-21 15:44:43

Answer 2

+1 A:

I agree with bobwienholt, but If you are indexing datasets I assume these came from a database somewhere. Would it make sense to just search that with a search engine like DTSearch or Lucene.net?

Andrew Cowenhoven 2008-10-21 15:13:50

Perhaps, but I guess that would be more complicated? i.e. the applicationObject's are stored across many different tables which map to different specific application objects. Ah, also our application is buffered so the in-memory dataset could be out of sync with the database.

RickL 2008-10-21 15:18:09

Answer 3

+2 A:

I see a few solutions:

If you have the ApplicationObjects in an array, store just the index - might be smaller.
You could use a bit of C++/CLI to store the dictionary, using UTF-8.
Don't bother storing all the different strings, use a Trie

MSalters 2008-10-21 15:17:08

For point 1) they're not stored in an array, but did you mean store the index instead of the string key? Then how do you search by the strings? Or did you mean instead of the List<ApplicationObject> to have a List<int>? I guess it might be smaller, but probably not a huge amount.

RickL 2008-10-21 15:32:58

Answer 4

+1 A:

I suspect you may find you've got a lot of very small lists.

I suggest you find out roughly what the frequency is like - how many of your dictionary entries have single element lists, how many have two element lists etc. You could potentially store several separate dictionaries - one for "I've only got one element" (direct mapping) then "I've got two elements" (map to a Pair struct with the two references in) etc until it becomes silly - quite possibly at about 3 entries - at which point you go back to normal lists. Encapsulate the whole lot behind a simple interface (add entry / retrieve entries). That way you'll have a lot less wasted space (mostly empty buffers, counts etc).

If none of this makes much sense, let me know and I'll try to come up with some code.

Jon Skeet 2008-10-21 16:01:24

That's an interesting observation... yes, I would think that most of the lists would be very small. With your suggestion, I guess that the inverted index creation would take longer as you would have to move items between the 1-item, 2-item, etc dictionaries, but could potentially save space.

RickL 2008-10-21 16:15:22

I suspect the difference in performance would be pretty small, to be honest - but yes, there would be some overhead. Definitely worth checking the distribution before coding it up though :)

Jon Skeet 2008-10-21 16:21:40

One thought to make it potentially cheaper to start with: just have a single Dictionary<string, IEnumerable<ApplicationObject>>. It means having an object per value, rather than using structs to avoid that, but you'd only ever have to replace the entry in the dictionary instead of doing a remove/add

Jon Skeet 2008-10-21 16:25:02

True, I will try this out tomorrow and report what difference it made.

RickL 2008-10-21 16:31:35

I investigated, and think this change would have some impact although not a huge impact. e.g. about 50% of the lists are 1, 2 or 3 elements long. But overall those lists account for about 5% of the total number of elements.

RickL 2008-10-22 13:22:54

Answer 5

A:

Is the index only added to or do you remove keys from it as well?

Lasse V. Karlsen 2008-10-21 16:08:33

Keys should be removed from the index if and when the referenced ApplicationObject is deleted.

RickL 2008-10-21 16:16:51

Answer 6

A:

You could take the approach Lucene did. First, you create a random access in-memory stream (System.IO.MemoryStream), this stream mirrors a on-disk one, but only a portion of it (if you have the wrong portion, load up another one off the disk). This does cause one headache, you need a file-mappable format for your dictionary. Wikipedia has a description of the paging technique.

On the file-mappable scenario. If you open up Reflector and reflect the Dictionary class you will see that is comprises of buckets. You can probably use each of these buckets as a page and physical file (this way inserts are faster). You can then also loosely delete values by simply inserting a "item x deleted" value to the file and every so often clean the file up.

By the way, buckets hold values with identical hashes. It is very important that your values that you store override the GetHashCode() method (and the compiler will warn you about Equals() so override that as well). You will get a significant speed increase in lookups if you do this.

Jonathan C Dickinson 2008-10-22 05:41:43

Answer 7

A:

How about using Memory Mapped File Win32 API to transparently back your memory structure?

http://www.eggheadcafe.com/articles/20050116.asp has the PInvokes necessary to enable it.

stephbu 2008-10-22 05:50:25

ansaurus

tags:

views:

answers:

In-memory search index for application takes up too much memory - any suggestions?

related questions