views:

467

answers:

3

I recently created a script that parses several web proxy logs into a tidy sqlite3 db file that is working great for me... with one snag. the file size. I have been pressed to use this format (a sqlite3 db) and python handles it natively like a champ, so my question is this... what is the best form of string compression that I can use for db entries when file size is the sole concern. zlib? base-n? Klingon?

Any advice would help me loads, again just string compression for characters that are compliant for URLs.

A: 

Here is a page with an SQLite extension to provide compression.

This extension provides a function that can be called on individual fields.

Here is some of the example text from the page

create a test table

sqlite> create table test(name varchar(20),surname varchar(20));

insert into test table some text by compressing text, you can also compress binary content and insert it into a blob field

sqlite> insert into test values(mycompress('This is a sample text'),mycompress('This is a sample text'));

this shows nothing because our data is in binary format and compressed

sqlite> select * from test;

following works, it uncompresses the data

sqlite> select myuncompress(name),myuncompress(surname) from test;

Philip T.
A: 

what sort of parsing do you do before you put it in the database? I get the impression that it is fairly simple with a single table holding each entry - if not then my apologies.

Compression is all about removing duplication, and in a log file most of the duplication is between entries rather than within each entry so compressing each entry individually is not going to be a huge win.

This is off the top of my head so feel free to shoot it down in flames, but I would consider breaking the table into a set of smaller tables holding the individual parts of the entry. A log entry would then mostly consist of a timestamp (as DATE type rather than a string) plus a set of indexes into other tables (e.g. requesting IP, request type, requested URL, browser type etc.)

This would have a trade-off of course, since it would make the database a lot more complex to maintain, but on the other hand it would enable meaningful queries such as "show me all the unique IPs that requested page X in the last week".

Dave Kirby
A: 

Instead of inserting compression/decompression code into your program, you could store the table itself on a compressed drive.

Mark Byers