HashSet of Strings taking up too much memory, suggestions...?

tags:

java
memory

views:

207

answers:

+8 Q:

HashSet of Strings taking up too much memory, suggestions...?

Hello.

I am currently storing a list of words (around 120,000) in a HashSet, for the purpose of using as a list to check enetered words against to see if they are spelt correctly, and just returning yes or no.

I was wondering if there is a way to do this which takes up less memory. Currently 120,000 words is around 12meg, the actual file the words are read from is around 900kb.

Any suggestions?

Thanks in advance

+9 A:

You could use a prefix tree or trie: http://en.wikipedia.org/wiki/Trie

sandris 2010-07-12 11:35:49

Just looking now for any trie or DAWG code i can use...!

Dori 2010-07-12 11:47:00

I have tried implementing the trie code from http://forums.sun.com/thread.jspa?threadID=5295936 with the extra suggestions at bottom and it comes in 3 times larger at 31meg, whats going on here?!

Dori 2010-07-12 12:35:43

At first sight, the line <code>children = new TrieNode[26];</code> seems unreasonable, it always allocates memory for all potential children.

sandris 2010-07-12 13:44:59

Maybe it's not much memory, though ... Keeping a reference to the parent might be superfluous, too, if you only want to check whether a word is in the dictionary.

sandris 2010-07-12 14:05:23

+3 A:

HashSet is probably not the right structure for this. Use Trie instead.

Aaron Digulla 2010-07-12 11:40:01

The problem is by design: Storing such a huge amount of words in a HashSet for spell-check-reasons isn't a good idea:

You can either use a spell-checker (example: http://softcorporation.com/products/spellcheck/ ), or you can buildup a "auto-wordcompletion" with a prefix tree ( description: http://en.wikipedia.org/wiki/Trie ).

There is no way to reduce memory-usage in this design.

Erik 2010-07-12 11:40:13

What parameters are you passing into the HashSet constructor? Are you playing round with the loadFactor parameter? Can you paste some appropriate code here?

Noel M 2010-07-12 11:43:47

The array of the `HashSet` isn't really important. It's mostly the object overhead of the `Entry`, the `String` and the `char[]`.

Tom Hawtin - tackline 2010-07-12 11:47:58

+1 A:

One way to save memory to save memory is to use a radix tree. This is better than a trie as the prefixes are not stored redundantly.

As your dictionary is fixed another way is to build a perfect hash function for it. Your hash set does not need buckets (and the associated overhead) as there cannot be collisions. Every implementation of a hash table/hash set that uses open addressing can be used for this (like google collection's ImmutableSet).

Thomas Jung 2010-07-12 11:45:44

+2 A:

Check out bloom filters or cuckoo hashing. http://stackoverflow.com/questions/867099/bloom-filter-or-cuckoo-hashing

I am not sure if this is the answer for your question but worth looking into these alternatives. bloom filters are mainly used for spell checker kind of use cases.

Pangea 2010-07-12 12:51:27

Or http://stackoverflow.com/questions/2294915/what-algorithm-is-used-to-give-suggestions-in-a-spell-checker/2294926#2294926

Thomas Jung 2010-07-12 13:54:16

It think bloom hashing sound great as i can cope with the 0.01% false positive rate at around 20b per entry, which equates to 200Kb for my 100k word list, which is a vast improvment. Thanks!(will read cuckoo hashing now)

Dori 2010-07-13 10:41:11

Found an open source Java Bloom Filter http://code.google.com/p/java-bloomfilter/

Dori 2010-07-13 11:33:28

You can also try Radix Tree(Wiki,Implementation) .This some what like trie but more memory efficient.

Emil 2010-08-03 06:33:18

Hello,

This might be a bit late but using Google you can easily find the DAWG investigation and C code that I posted a while ago.

http://www.pathcom.com/~vadco/dawg.html

TWL06 - 178,691 words - fits into 494,676 Bytes

The downside of a compressed-shared-node structure is that it does not work as a hash function for the words in your list. That is to say, it will tell you if a word exists, but it will not return an index to related data for a word that does exist.

If you want the perfect and complete hash functionality, in a processor-cache sized structure, you are going to have to read, understand, and modify a data structure called the ADTDAWG. It will be slightly larger than a traditional DAWG, but it is faster and more useful.

http://www.pathcom.com/~vadco/adtdawg.html

All the very best,

JohnPaul Adamovsky

JohnPaul Adamovsky 2010-09-03 18:40:06

ansaurus

tags:

views:

answers:

HashSet of Strings taking up too much memory, suggestions...?

related questions