views:

4082

answers:

11

edit: many thanks for all the answers. Here are the results after applying the optimisations so far:

  • Switching to sorting the characters and run length encoding - new DB size 42M
  • Dropping the indexes on the booleans - new DB size 33M

The really nice part is this hasn't required any changes in the iphone code

I have an iphone application with a large dictionary held in sqlite format (read only). I'm looking for ideas to reduce the size of the DB file, which is currently very large.

Here is the number of entries and resulting size of the sqlite DB:

franks-macbook:DictionaryMaker frank$ ls -lh dictionary.db
-rw-r--r--  1 frank  staff    59M  8 Oct 23:08 dictionary.db
franks-macbook:DictionaryMaker frank$ wc -l dictionary.txt
  453154 dictionary.txt

...an average of about 135 bytes per entry.

Here is my DB schema:

create table words (word text primary key, sowpods boolean, twl boolean, signature text)
create index sowpods_idx on words(sowpods)
create index twl_idx on words(twl)
create index signature_idx on words(signature)

Here is some sample data:

photoengrave|1|1|10002011000001210101010000
photoengraved|1|1|10012011000001210101010000
photoengraver|1|1|10002011000001210201010000
photoengravers|1|1|10002011000001210211010000
photoengraves|1|1|10002011000001210111010000
photoengraving|1|1|10001021100002210101010000

The last field represents the letter frequencies for anagram retrieval (each position is in the range 0..9). The two booleans represent sub dictionaries.

I need to do queries such as:

select signature from words where word = 'foo'
select word from words where signature = '10001021100002210101010000' order by word asc
select word from words where word like 'foo' order by word asc
select word from words where word = 'foo' and (sowpods='1' or twl='1')

One idea I have is to encode the letter frequencies more efficiently, e.g. binary encode them as a blob (perhaps with RLE as there are many zeros?). Any ideas for how best to achieve this, or other ideas to reduce the size? I am building the DB in ruby, and reading it on the phone in objective C.

Also is there any way to get stats on the DB so I can see what is using the most space?

+1  A: 

Your best bet is to use compression, which unfortunately SQLite does not support natively at this point. Luckily, someone took the time to develop a compression extension for it which could be what you need.

Otherwise I'd recommend storing your data mostly in compressed format and uncompressing on the fly.

Eran Galperin
the difficulty with an extension would be getting it loaded on the iphone side
frankodwyer
+1  A: 

The creator of SQLite sells a version of SQLite that includes database compression (and encryption). This would be perfect.

dicroce
A: 

Do I reckon correctly that you have about 450K words like that in your database ?

I've got no clue about iPhone, neither serious about sqlitem but... as long as sqlite does not allow for a way to save the file as gz right away (it maybe already does internally? no, does not look like that when you say it's about 135 b per entry. not even with both indexes), I would move away from the table approach, save it "manually" in a dictionary approach compression and build the rest on the fly and in memory. That should perform VERY well on your type of data.

Wait... Are you using that signature to allow for fulltextsearching or mistyping recogition ? Would full text search on sqlite not obsolete that field ?

lImbus
Yes, about 450K words. An in memory approach probably won't work as the raw file data is 18M and memory on the phone is pretty limited. Nor will fulltext search work as I need to run the queries shown - i.e. find all words matching a pattern, or having the same signature (for anagrams).
frankodwyer
+2  A: 

I'm not clear on all the use cases for the signature field but it seems like storing an alphabetized version of the word instead would be beneficial.

Robert Simmons
True, that might be smaller and also works for anagram searches (which is the only use case). I will try that.
frankodwyer
Good idea! The only enhancement I can think of for this method would be to replace sequences of 3 or more letters in a row with a number and the letter. For example if your signature for "mississippi" is "iiiimppssss" you could shorten it to "4impp4s"
Marc Novakowski
also a nice idea...plus it will probably work without changing my current iphone code at all (still would just look for equality). I will try that too. extra points for ruby code that computes that signature given a word :-)
frankodwyer
Please excuse the crudity of this solution as I don't really know Ruby: "mississippi".split( // ).sort.join.gsub(/(.)\1{2,}/) { |s| s.length.to_s + s[0,1] }
Robert Simmons
I've set up another question to explore this option a bit more. http://stackoverflow.com/questions/401834/how-to-elegantly-compute-the-anagram-signature-of-a-word-in-ruby
frankodwyer
I am highly interested to see the new file size after this optimization.
lImbus
I've updated the question to show the results so far.
frankodwyer
+2  A: 

Have you tried typing the "vacuum" command to make sure you don't have extra space in the db you forgot to reclame?

Jared
I haven't tried it, however as I never delete anything from the database (it is created in a single import from a text file), I wouldn't expect it to save any space. I will try it though.
frankodwyer
+1  A: 

As a text field, signature is currently using at least 26 * 8 bytes per entry (208 bytes) but if you were to pack the data into a bitfield, you could probably get away with only 3 bits per letter (reducing your maximum frequency per letter to 7). That would mean you could pack the entire signature in 26 * 3 bits = 78 bits = 10 bytes. Even if you used 4 bits per letter (for a maximum frequency of 15 per letter) you would only use 104 bits (13 bytes).

EDIT: After a bit more thought, I think 4 bits per letter (instead of 3) would be a better idea because it would make the binary math easier.

EDIT2: Reading through the docs on SQLite data types, it seems that you might be able to just make the "signature" field span 26 columns of type INTEGER and SQLite will do the right thing and only use as many bits as required to store the value.

Marc Novakowski
I think this approach would save the OP a lot of space. It will require the use of the BLOB data type in SQLite.
codelogic
Also, rounding up the 13 bytes to 16, this can even be split into two 8 byte integers, which SQLite supports, allowing the use of indexes on the tables. This approach would require the signature to be split into two though and the query would need to check equality on both.
codelogic
Yes I was thinking of something along these lines...each position in the signature field is 0..9 so could be packed into 4 bits. Coding the signature as a large binary integer I think would be smaller again.
frankodwyer
@codelogic, do you mean I can't index blobs?
frankodwyer
frankodwyer, I'm not sure about indexing blobs.
codelogic
A: 

As noted storing "Signature" more efficiently seems like a good idea.

However, it also seems like you could gain a ton of space savings by using some kind of lookup table for words - since you seem to be taking a root word and then appending "er", "ed", "es", etc why not have a column with a numeric ID that references a root word from a separate lookup table, and then a separate column with a numeric ID that references a table of common word suffixes that would be appended to the base word.

If there were any tricks around storing shorthand versions of signatures for multiple entries with a single root word, you could also employ those to reduce the size of stored signatures (not sure what algorithm is producing those values)

This also seems to make a lot of sense to me as you have the "word" column as a primary key, but do not even index it - just create a separate numeric column that is the primary ID for the table.

Kendall Helmstetter Gelner
Sounds reasonable however quite complex to implement.
frankodwyer
re the primary key, I understand it is automatically indexed in sqlite
frankodwyer
+1  A: 

mhmm... an iPhone... doesn't it have a permanent data connection ? I think this is where a webapplication/webservice can jump in snugly. Move most of your business logic to the webserver (he's gonna have real SQL with FTS and looooots of memory) and fetch that info online to the client on the device.

lImbus
That's true, but I want this to be able to work offline. Also it's just a free app and I don't want to maintain a server for it.
frankodwyer
+2  A: 

As a totally different approach, you could try using a bloom filter instead of a comprehensive database. Basically, a bloom filter consists of a bunch of hash functions, each of which is associated with a bitfield. For each legal word, each hash function is evaluated, and the corresponding bit in the corresponding bit field is set. Drawback is it's theoretically possible to get false positives, but those can be minimized/practically eliminated with enough hashes. Plus side is a huge space savings.

recursive
thanks - that is a really interesting idea and new to me. I am going to pursue the idea of sorting the letters and run length encoding first, but will also look at this approach. A post-processing step could be used to remove false positives (easily identified in this case).
frankodwyer
+2  A: 

Remove the indexes on sowpods and twl -- they are probably not helping your query times and are definitely taking lots of space.

You can get stats on the database using sqlite3_analyzer from the SQLite downloads page.

Doug Currie
thanks - dropping those indexes really helped.
frankodwyer
A: 

As mentioned elsewhere, lose the indexes on the boolean columns, they will almost certainly be slower (if used at all) than a table scan and are going to use space needlessly.

I'd consider applying a simple compression to the words, Huffman coding is pretty good for this sort of thing. Also, I'd look at the signatures: sort the columns in letter frequency order and don't bother storing trailing zeroes, which can be implied. I guess you could Huffman-encode those, too.

Always assuming your encoded strings don't upset SQLite, of course.

Mike Woodhouse