views:

51

answers:

2

Are very large TextProperties a burden? Should they be compressed?

Say I have a information stored in 2 attributes of type TextProperty in my datastore entities. The strings are always the same length of 65,000 characters and have lots of repeating integers, a sample appearing as follows:

entity.pixel_idx   = 0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,5,5,5,5,5,5,5,5,5,5,5,5....etc.
entity.pixel_color = 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,...etc.

So these above could also be represented using much less storage memory by compressing say using only each integer and the length of its series ( '0,8' for '0,0,0,0,0,0,0,0') but then its takes time and CPU to compress and decompress? Any general ideas? Are there some tricks for testing different attempts to the problem?

A: 

I think this should be pretty easy to test. Just create 2 handlers, one that compresses the data, and one that doesn't, and record how much cpu each one uses (using the appstats package for whichever language you are developing with.) You should also create 2 entity types, one for the compressed data, one for the uncompressed.

Load in a few hundred thousand or a million entities (using the task queue perhaps). Then you can check the disk space usage in the administrator's console, and see how much each entity type uses. If the data is compressed internally by app engine, you shouldn't see much difference in the space used (unless their compression is significantly better than yours) If it is not compressed, there should be a stark difference.

Of course, you may want to hold off on this type of testing until you know for sure that these entities will account for a significant portion of your quota usage and/or your page load time.

Alternatively, you could wait for Nick or Alex to pop in and they could probably tell you whether the data is compressed in the datastore or not.

Peter Recore
thanks i will start to play with appstats
indiehacker
+3  A: 

If all of your integers are single-digit numbers (as in your example), then you can reduce your storage space in half by simply omitting the commas.

The Short Answer

If you expect to have a lot of repetition, then compressing your data makes sense - your data is not so small (65K) and is highly repetitive => it will compress well. This will save you storage space and will reduce how long it takes to transfer the data back from the datastore when you query for it.

The Long Answer

I did a little testing starting with the short example string you provided and that same string repeated to 65000 characters (perhaps more repetitive than your actual data). This string compressed from 65K to a few hundred bytes; you may want to do some additional testing based on how well your data actually compresses.

Anyway, the test shows a significant savings when using compressed data versus uncompressed data (for just the above test where compression works really well!). In particular, for compressed data:

  • API time takes 10x less for a single entity (41ms versus 387ms on average)
  • Storage used is significantly less (so it doesn't look like GAE is doing any compression on your data).
  • Unexpectedly, CPU time is about 50% less (130ms versus 180ms when fetching 100 entities). I expected CPU time to be a little worse since the compressed data has to be uncompressed. There must be some other CPU work (like decoding the protocol buffer) which is even more CPU work for the much larger uncompressed data.
  • These differences mean wall clock time is also significantly faster for the compressed version (<100ms versus 426ms when fetching 100 entities).

To make it easier to take advantage of compression, I wrote a custom CompressedDataProperty which handles all of the compressing/decompressing business so you don't have to worry about it (I used it in the above tests too). You can get the source from the above link, but I've also included it here since I wrote it for this answer:

from google.appengine.ext import db
import zlib
class CompressedDataProperty(db.Property):
  """A property for storing compressed data or text.

  Example usage:

  >>> class CompressedDataModel(db.Model):
  ...   ct = CompressedDataProperty()

  You create a compressed data property, simply specifying the data or text:

  >>> model = CompressedDataModel(ct='example uses text too short to compress well')
  >>> model.ct
  'example uses text too short to compress well'
  >>> model.ct = 'green'
  >>> model.ct
  'green'
  >>> model.put() # doctest: +ELLIPSIS
  datastore_types.Key.from_path(u'CompressedDataModel', ...)

  >>> model2 = CompressedDataModel.all().get()
  >>> model2.ct
  'green'

  Compressed data is not indexed and therefore cannot be filtered on:

  >>> CompressedDataModel.gql("WHERE v = :1", 'green').count()
  0
  """
  data_type = db.Blob

  def __init__(self, level=6, *args, **kwargs):
    """Constructor.

    Args:
    level: Controls the level of zlib's compression (between 1 and 9).
    """
    super(CompressedDataProperty, self).__init__(*args, **kwargs)
    self.level = level

  def get_value_for_datastore(self, model_instance):
    value = self.__get__(model_instance, model_instance.__class__)
    if value is not None:
      return db.Blob(zlib.compress(value, self.level))

  def make_value_from_datastore(self, value):
    if value is not None:
      return zlib.decompress(value)
David Underhill
nice job. Of course, your results naturally spark a new question in my mind - at what size does it become no longer worth compressing? It looks like it might still be worth compressing things even as "small" as 10k.
Peter Recore
Thanks :). Unfortunately, I think the turning point where compression becomes useful depends heavily on your data. After all, you could have even a megabyte of data and maybe get very little from compressing it - e.g., trying to compress `os.urandom(2**20)` is probably going to result in something bigger than the input.Thankfully, I think long passages composed by humans tend to compress better (e.g., my answer compresses to about 40% of its original size).
David Underhill
Thanks for the informative answer, the code , and nice surprise regrading lower CPU time.
indiehacker