views:

2047

answers:

7

I have an sqlite database full of huge number of URLs and it's taking huge amount of diskspace, and accessing it causes many disk seeks and is slow. Average URL path length is 97 bytes (host names repeat a lot so I moved them to a foreign-keyed table). Is there any good way of compressing them? Most compression algorithms work well with big documents, not "documents" of less that 100 bytes on average, but even 20% reduction would be very useful. Any compression algorithms that would work? Doesn't have to be anything standard.

+8  A: 

Use the compress algorithm but use a shared dictionary.

I've done something like this before where I used the LZC/LZW algorithm, as used by the Unix compress command.

The trick to get good compression with short strings is to use a dictionary made up of a standard sample of the URLs you are compressing.

You should easily get 20%.

Edit: LZC is a variant of LZW. You only require LZW as you only need a static dictionary. LZC adds support for resetting the dictionary/table when it gets full.

ewalshe
I've combined compression with a hash table to convert a URL to 32 bits id before, which takes much less space than the strings. You also don't need the URL table - you just store the id rather than a foreign key.
Pete Kirkham
Isn't it the LZW?
Kezzer
@Kezzer. You are right. LZC is a variant of LZW.
ewalshe
+4  A: 

Have you considered using static Huffman coding?

You could use your existing body of URLs to compute Huffmann codes over all bytes occuring in URLs according to their frequency. You can then store that set of codes once, and encode all URLs using it. I think it should give decent compression.

bendin
+2  A: 

What's the format of your URLs?

If any URL share one or more domain and you're sufficient with about 2 billion domain names you can create a pool for domain names. And if you have shared relative paths you can pool them to.

For every URL in your database, split each URL into three parts. the scheme and domain e.g. http://mydomain.com the realtive url /my/path/ and then the rest mypage.html?id=4 (if you have query string parameters)

This way you should reduce the overhead of every domain and relative path to just about 8 bytes. That should be better, and fast if you wanna lookup parts of URLs.

Note: just the "http" scheme string itself is 4 bytes, you'll save anything beyond that on every domain entry. If every URL starts with "http://www." you'll save "://www." 7 bytes each time.

Experiment a bit on how to split and structure URLs, I'm betting this is were you'll find your compression. Now, the remaining string that is not a common domain or relative path, what could you do with that?

Compressing URLs

General purpose compression such methods are derived from arithmetic encoding. Shannon the father of information theory wrote a paper about this in the 60's. I've been working with compression for a while and the one thing I've always found, is that general purpose compression never solves the actual problem.

You're in luck, because URLs have structure and that structure you should utilize to better store your URLs.

If you wanna apply a compression algorithm (I think the topic should be changed to reflect URL compression, because it's domain specific) you'll have to examine the entropy of your data. Because it will tell you something about the yield of storage. URLs are ASCII characters, any character not within the ASCII range 0x20-0x7E won't be occurring and throwing away case sensitivity, you're down to a mere 63 distinct states. !"#%&'()*+,-./0123456789:;<=>?@abcdefghijklmnopqrstuvwxyz{|}~ including white-space.

You could create a frequency table of the remaining characters and perform arithmetic encoding. You know that you'll need at most 6-bits, which means for every character in your URL database you're wasting 2 bits right now, and if you just shifted things into place and used a lookup table, you'd get your 20% compression. Just like that ;)

Because the data is so specific it's really not a good idea to just compress with general purpose methods. It's better to structure the information and split that into pieces of data you can store more efficiently. You know a lot about the domain, use that knowledge to compress your data.

John Leidegren
As I said I'm already storing path only (http://host.com/ part is foreign-keyed), and URLs are very much case sensitive, so it won't work. Saving 1 in every 8 bits is an obvious optimization, I just wondered if there's anything better than that.
taw
According to RFC something, URLs should not be case-sensitive, if you need case sensitivity then handle that. A frequency analysis from which you build a static table for arithmetic encoding is a good start. You need to work on a bit level, break up the path if you can. Look for patterns in your db.
John Leidegren
A: 

Is that 97 bytes, or 97 8-bit ASCII characters, or 97 16-bit Unicode characters?

Assuming that all your URLs are legal URLs as per http://www.w3.org/Addressing/URL/url-spec.txt, then you should have only ASCII characters.

If 97 16-bit Unicode characters simply storing the lower byte of each character will automatically give you 50% savings.

If 97 8-bit characters, notice that you only need 7-bits. You can simply pass in 7 bits at a time into your bitstream and store that bitstream into your database; use some older 7-bit transmission protocol; or come up with your own adhoc way of storing every 8th character's bits in the high bits of the previous 7 characters.

Ants
A: 

How to you use the URL table?

Do you usually do a "range scan" or unique id lookup only?

If you won't do something like WHERE url like "/xxxx/question/%". You could use a hashed index rather then a b-tree index on varchar() to reduce the number of disk seeks.

Dennis Cheung
+2  A: 

Abstract:

A common problem of large scale search engines and web spiders is how to handle a huge number of encountered URLs. Traditional search engines and web spiders use hard disk to store URLs without any compression. This results in slow performance and more space requirement. This paper describes a simple URL compression algorithm allowing efficient compression and decompression. The compression algorithm is based on a delta encoding scheme to extract URLs sharing common prefixes and an AVL tree to get efficient search speed. Our results show that the 50 % of size reduction is achieved. 1.

-- Kasom Koht-arsa Department of Computer Engineering.

Resource

Lukas Šalkauskas
The solution presented in the paper doesn't seem extensible to cases where URLs don't fit memory, and they won't even with very high compression rates. Also their data (avg 55 bytes/entire URL) seems very different from mine (avg 97 bytes for just path part).
taw
I told this because it can bring some nice ideas for you.
Lukas Šalkauskas
+1  A: 

I've tried this using the following strategy. It's using a shared dictionary, but working around the way python's zlib doesn't give you access to the dictionary itself.

First, initialize a pre-trained compressor and decompressor by running a bunch of training strings through them. Throw away the output strings.

Then, use copies of the trained compressor to compress every small string, and use copies of the decompressor to decompress them.

Here my the python code (apologies for the ugly testing):

import zlib
class Trained_short_string_compressor(object):
    def __init__(self,
                 training_set, 
                 bits = -zlib.MAX_WBITS,
                 compression = zlib.Z_DEFAULT_COMPRESSION,
                 scheme = zlib.DEFLATED):
        # Use a negative number of bits, so the checksum is not included.
        compressor = zlib.compressobj(compression,scheme,bits)
        decompressor = zlib.decompressobj(bits)
        junk_offset = 0
        for line in training_set:
            junk_offset += len(line)
            # run the training line through the compressor and decompressor
            junk_offset -= len(decompressor.decompress(compressor.compress(line)))

        # use Z_SYNC_FLUSH. A full flush seems to detrain the compressor, and 
        # not flushing wastes space.
        junk_offset -= len(decompressor.decompress(compressor.flush(zlib.Z_SYNC_FLUSH)))

        self.junk_offset = junk_offset
        self.compressor = compressor
        self.decompressor = decompressor

    def compress(self,s):
        compressor = self.compressor.copy()
        return compressor.compress(s)+compressor.flush()

    def decompress(self,s):
        decompressor = self.decompressor.copy()
        return (decompressor.decompress(s)+decompressor.flush())[self.junk_offset:]

Testing it, I saved over 30% on a bunch of 10,000 shortish (50 -> 300 char) unicode strings. It also took about 6 seconds to compress and decompress them (compared to about 2 seconds using simple zlib compression / decompression). On the other hand, the simple zlib compression saved about 5%, not 30%.

def test_compress_small_strings():
    lines =[l for l in gzip.open(fname)]
    compressor=Trained_short_string_compressor(lines[:500])

    import time
    t = time.time()
    s = 0.0
    sc = 0.
    for i in range(10000):
        line = lines[1000+i] # use an offset, so you don't cheat and compress the training set
        cl = compressor.compress(line)
        ucl = compressor.decompress(cl)
        s += len(line)
        sc+=len(cl)
        assert line == ucl

    print 'compressed',i,'small strings in',time.time()-t,'with a ratio of',s0/s
    print 'now, compare it ot a naive compression '
    t = time.time()
    for i in range(10000):
        line = lines[1000+i]
        cr = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION,zlib.DEFLATED,-zlib.MAX_WBITS)
        cl=cr.compress(line)+cr.flush()
        ucl = zlib.decompress(cl,-zlib.MAX_WBITS)
        sc += len(cl)
        assert line == ucl


    print 'naive zlib compressed',i,'small strings in',time.time()-t, 'with a ratio of ',sc/s

Note, it's not persistent if you delete it. If you wanted persistence, you would have to remember the training set.

wisty