views:

133

answers:

3

I am constructing an anagram generator that was a coding exercise, and uses a word list thats about 633,000 lines long (one word per line). I wrote the program just in Ruby originally, and I would like to modify this to deploy it online.

My hosting service supports Ruby on Rails as about the only Ruby-based solution. I thought of hosting on my own machine, and using a smaller framework, but I don't want to deal with the security issues at this moment.

I have only used RoR for database-driven (CRUD) apps. However, I have never populated a sqlite database this way, so this is a two-part question:

1) Should I import this to a database? If so, what's the best method to do so? I would like to stick with sqlite to keep things simple if that's the case.

2) Is a 'flat file' better? I wont be doing any creating or updating, just checking against the list of words.

Thank you.

A: 

Assuming that what you're doing is looking up whether a word exists in your list, I would say that SQLite with an indexed column will likely be faster than scanning through the word list linearly. Now, if your current approach is fast enough for your purposes, then I see no reason to bother porting it over to a database; it's just an added headache for no gain as far as you're concerned. If you're seeing the search times become a burden, then dumping it into an indexed database would be a good idea.

You can create the table with the following schema:

CREATE TABLE words (
       word text primary key
);

CREATE INDEX word_idx ON words(word);

And import your data with:

sqlite words.db < schema.sql
while read word 
do 
   sqlite3 words.db "INSERT INTO words values('$word');"
done < words.txt
Brian Campbell
+2  A: 

How about keeping it in memory? Storing that many words would take just a few megabytes of RAM, and otherwise you'd be accessing the file frequently so it'd probably be cached anyway. The advantage of keeping the word list in memory is that you can organize it in whatever data structure suits your needs best (I'm thinking a trie). If you can't spare that much memory, it might be to your advantage to use a database so you can efficiently load only the parts of the word list you need for any given query - of course, in that case you'd want to create some index columns (well at least one) so you can take advantage of the indexing capabilities of SQL.

David Zaslavsky
A: 

I would skip the database for reasons listed above. A simple hash in memory will perform about as fast a lookup in the database.

Even if the database was a bit faster for the lookup, you're still wasting time with the DB having to parse the query and create a plan for the lookup, then assemble the results and send them back to your program. Plus you can save yourself a dependency.

If you plan on moving other parts of your program to a persistent store, then go for it. But a hashmap should be sufficient for your use.

Trey Stout
? The database is going to be at least a hundred times slower than a hash in memory. But there is no need to hash, just sort the list, start with the 26 first letters as a radix and then do a binary search
Stephan Eggermont