tags:

views:

38

answers:

3

I have a set of 5 million strings. These are currently stored in a single column MySQL table. My application has to perform lookups and check if a given string is in the set. This can of course be done with a HashSet (in Java). But instead of building a custom solution, I was wondering if there are any existing, widely used, proven solutions that do this? It seems like a common scenario. The solution should be scalable (the set might increase beyond 5 million), have failover (so probably distributed) and perform well under a huge number of requests. Any suggestions?

Update: My app can also query to check if a given set of strings is present in the global (the 5 million one) set.

+1  A: 

You can try Trie or Patricia-trie.The second is more memory efficient.Also here you can find a comparison of 2 data structures [Trie,TreeSet],In-memory database and their performance.

Emil
@Emil - the message in front of the Trie project is not very encouraging - "Note to any visitors, this is fine SAMPLE code, but not production code. It was written by an inexperienced programmer (me at the time) in one evening. "
talonx
A: 

While a Trie might be the best solution, binary search on the sorted list of strings should also perform well run time wise.

michid
+1  A: 

Try memcached, a high-performance, distributed memory object caching system. You lookup using key/value hashes. Facebook uses memcached as do many other highly scalable sites. Need to store more strings? Just add more memcached instances to the cluster. Plus you can use in a 2-tier caching setup where you first query memcached, if cache miss then query the full database.

Have you considered adding column indexing to your MySQL database? Hash, b-tree and r-tree are supported.

MySQL can also be replicated and clustered for high scalability.

burkestar
How does it solve the problem?
reinierpost
It's a distributed hashing system for efficient key/value lookups.
burkestar