tags:

views:

55

answers:

1

Dear fellas,

I'm trying to perform a dictionary-based NER on some documents. My dictionary, regardless of the datatype, consists of key-value pairs of strings. I want to search for all the keys in the document, and return the corresponding value for that key whenever a match occurs.

The problem is, my dictionary is fairly large: ~7 million key-values - average length of keys: 8 and average length of values: 20 characters.

I've tried LingPipe with MapDictionary but on my desired environment setup, it runs out of memory after 200,000 rows are inserted. I don't know clearly why LingPipe uses a map and not a hashmap in their algorithm.

So the thing is, I don't have any previous experience with Lucene and I want to know if it makes such thing with such number possible in an easier way.

ps. I've already tried chunking the data into several dictionaries and writing them on disk but it's relatively slow.

Thanks for any help.

Cheers Parsa

+1  A: 

I suppose if you wanted to reuse LingPipe's ExactDictionaryChunker to do the NER, you could override their MapDictionary to store & retrieve from your choice of key/value database instead of their ObjectToSet (which does extend HashMap, by the way).

Lucene/solr can be used as a key/value store, but if you don't need the extra searching capabilities, just a pure look-up, other options might be better for what you're doing.

msbmsb
Can you give me some advice on overriding MapDictionary? I'm not familiar with their code structure and I'm confused.
parsa28
Extend the MapDictionary class and override the addEntry, iterator and phraseEntryIt functions to persist and retrieve from an external data store. Currently, MapDictionary uses an ObjectToSet class that is a type of HashMap to store the entries in memory. It sounds like you may want to store these into some kind of key/value store instead. So the new class and overridden functions would interface with the external db instead of the ObjectToSet class.
msbmsb
Kudos, so much appreciated.
parsa28
I did as you said but only problem is I'm obviously getting Iterator<MyType> from my key-value DB, which can't be casted to Iterator<DictionaryEntry<C>>, so how should I override my DB's Iterator to get Iterator<DictionaryEntry<C>>? Sorry for my Java noobness.
parsa28
I'm now overriding Iterator<DictionaryEntry<C>>'s next method and making a new instance of DictionaryEntry from each key-value pair in the DB. But those instances are not released and I'm running out of heap space after some time. Am I doing something crazy here? Thanks.
parsa28