Dear fellas,
I'm trying to perform a dictionary-based NER on some documents. My dictionary, regardless of the datatype, consists of key-value pairs of strings. I want to search for all the keys in the document, and return the corresponding value for that key whenever a match occurs.
The problem is, my dictionary is fairly large: ~7 million key-values - average length of keys: 8 and average length of values: 20 characters.
I've tried LingPipe with MapDictionary but on my desired environment setup, it runs out of memory after 200,000 rows are inserted. I don't know clearly why LingPipe uses a map and not a hashmap in their algorithm.
So the thing is, I don't have any previous experience with Lucene and I want to know if it makes such thing with such number possible in an easier way.
ps. I've already tried chunking the data into several dictionaries and writing them on disk but it's relatively slow.
Thanks for any help.
Cheers Parsa