ansaurus

Question

Any strategies for assessing the trade-off between CPU loss and memory gain from compression of data held in a datastore model's TextProperty?

Answer 1

A:

I think this should be pretty easy to test. Just create 2 handlers, one that compresses the data, and one that doesn't, and record how much cpu each one uses (using the appstats package for whichever language you are developing with.) You should also create 2 entity types, one for the compressed data, one for the uncompressed.

Load in a few hundred thousand or a million entities (using the task queue perhaps). Then you can check the disk space usage in the administrator's console, and see how much each entity type uses. If the data is compressed internally by app engine, you shouldn't see much difference in the space used (unless their compression is significantly better than yours) If it is not compressed, there should be a stark difference.

Of course, you may want to hold off on this type of testing until you know for sure that these entities will account for a significant portion of your quota usage and/or your page load time.

Alternatively, you could wait for Nick or Alex to pop in and they could probably tell you whether the data is compressed in the datastore or not.

Peter Recore 2010-04-06 15:52:45

thanks i will start to play with appstats

indiehacker 2010-04-07 02:37:28

Answer 2

+3 A:

If all of your integers are single-digit numbers (as in your example), then you can reduce your storage space in half by simply omitting the commas.

The Short Answer

If you expect to have a lot of repetition, then compressing your data makes sense - your data is not so small (65K) and is highly repetitive => it will compress well. This will save you storage space and will reduce how long it takes to transfer the data back from the datastore when you query for it.

The Long Answer

I did a little testing starting with the short example string you provided and that same string repeated to 65000 characters (perhaps more repetitive than your actual data). This string compressed from 65K to a few hundred bytes; you may want to do some additional testing based on how well your data actually compresses.

Anyway, the test shows a significant savings when using compressed data versus uncompressed data (for just the above test where compression works really well!). In particular, for compressed data:

API time takes 10x less for a single entity (41ms versus 387ms on average)
Storage used is significantly less (so it doesn't look like GAE is doing any compression on your data).
Unexpectedly, CPU time is about 50% less (130ms versus 180ms when fetching 100 entities). I expected CPU time to be a little worse since the compressed data has to be uncompressed. There must be some other CPU work (like decoding the protocol buffer) which is even more CPU work for the much larger uncompressed data.
These differences mean wall clock time is also significantly faster for the compressed version (<100ms versus 426ms when fetching 100 entities).

To make it easier to take advantage of compression, I wrote a custom CompressedDataProperty which handles all of the compressing/decompressing business so you don't have to worry about it (I used it in the above tests too). You can get the source from the above link, but I've also included it here since I wrote it for this answer:

from google.appengine.ext import db
import zlib
class CompressedDataProperty(db.Property):
  """A property for storing compressed data or text.

  Example usage:

  >>> class CompressedDataModel(db.Model):
  ...   ct = CompressedDataProperty()

  You create a compressed data property, simply specifying the data or text:

  >>> model = CompressedDataModel(ct='example uses text too short to compress well')
  >>> model.ct
  'example uses text too short to compress well'
  >>> model.ct = 'green'
  >>> model.ct
  'green'
  >>> model.put() # doctest: +ELLIPSIS
  datastore_types.Key.from_path(u'CompressedDataModel', ...)

  >>> model2 = CompressedDataModel.all().get()
  >>> model2.ct
  'green'

  Compressed data is not indexed and therefore cannot be filtered on:

  >>> CompressedDataModel.gql("WHERE v = :1", 'green').count()
  0
  """
  data_type = db.Blob

  def __init__(self, level=6, *args, **kwargs):
    """Constructor.

    Args:
    level: Controls the level of zlib's compression (between 1 and 9).
    """
    super(CompressedDataProperty, self).__init__(*args, **kwargs)
    self.level = level

  def get_value_for_datastore(self, model_instance):
    value = self.__get__(model_instance, model_instance.__class__)
    if value is not None:
      return db.Blob(zlib.compress(value, self.level))

  def make_value_from_datastore(self, value):
    if value is not None:
      return zlib.decompress(value)

David Underhill 2010-04-06 16:22:04

nice job. Of course, your results naturally spark a new question in my mind - at what size does it become no longer worth compressing? It looks like it might still be worth compressing things even as "small" as 10k.

Peter Recore 2010-04-07 02:07:04

Thanks :). Unfortunately, I think the turning point where compression becomes useful depends heavily on your data. After all, you could have even a megabyte of data and maybe get very little from compressing it - e.g., trying to compress `os.urandom(2**20)` is probably going to result in something bigger than the input.Thankfully, I think long passages composed by humans tend to compress better (e.g., my answer compresses to about 40% of its original size).

David Underhill 2010-04-07 02:30:42

Thanks for the informative answer, the code , and nice surprise regrading lower CPU time.

indiehacker 2010-04-07 02:36:49

ansaurus

tags:

views:

answers:

Any strategies for assessing the trade-off between CPU loss and memory gain from compression of data held in a datastore model's TextProperty?

The Short Answer

The Long Answer

related questions