views:

198

answers:

1

2 possible ways of persisting large strings in the Google Datastore are Text and Blob properties.

From a storage consumption perspective which of the 2 is recommended? Same question from a protobuf serialization and deserialization perspective.

Thanks, Keyur

+2  A: 

There is no significant performance difference between the two - just use whichever one best fits your data. BlobProperty should be used to store binary data (e.g., str objects) while TextProperty should be used to store any textual data (e.g., unicode or str objects). Note that if you store a str in a TextProperty, it must only contain ASCII bytes (less than hex 80 or decimal 128) (unlike BlobProperty).

Both of these properties are derived from UnindexedProperty as you can see in the source.

Here is a sample app which demonstrates that there is no difference in storage overhead for these ASCII or UTF-8 strings:

import struct

from google.appengine.ext import db, webapp
from google.appengine.ext.webapp.util import run_wsgi_app

class TestB(db.Model):
    v = db.BlobProperty(required=False)

class TestT(db.Model):
    v = db.TextProperty(required=False)

class MainPage(webapp.RequestHandler):
    def get(self):
        self.response.headers['Content-Type'] = 'text/plain'

        # try simple ASCII data and a bytestring with non-ASCII bytes
        ascii_str = ''.join([struct.pack('>B', i) for i in xrange(128)])
        arbitrary_str = ''.join([struct.pack('>2B', 0xC2, 0x80+i) for i in xrange(64)])
        u = unicode(arbitrary_str, 'utf-8')

        t = [TestT(v=ascii_str), TestT(v=ascii_str*1000), TestT(v=u*1000)]
        b = [TestB(v=ascii_str), TestB(v=ascii_str*1000), TestB(v=arbitrary_str*1000)]

        # demonstrate error cases
        try:
            err = TestT(v=arbitrary_str)
            assert False, "should have caused an error: can't store non-ascii bytes in a Text"
        except UnicodeDecodeError:
            pass
        try:
            err = TestB(v=u)
            assert False, "should have caused an error: can't store unicode in a Blob"
        except db.BadValueError:
            pass

        # determine the serialized size of each model (note: no keys assigned)
        fEncodedSz = lambda o : len(db.model_to_protobuf(o).Encode())
        sz_t = tuple([fEncodedSz(x) for x in t])
        sz_b = tuple([fEncodedSz(x) for x in b])

        # output the results
        self.response.out.write("text:   1=>%dB  2=>%dB  3=>%dB\n" % sz_t)
        self.response.out.write("blob:   1=>%dB  2=>%dB  3=>%dB\n" % sz_b)

application = webapp.WSGIApplication([('/', MainPage)])
def main(): run_wsgi_app(application)
if __name__ == '__main__': main()

And here is the output:

text:   1=>172B  2=>128047B  3=>128047B
blob:   1=>172B  2=>128047B  3=>128047B
David Underhill
I wasn't aware that Text properties can only contain ASCII bytes. That realization answers my question. Thanks.
Keyur
That's not true - text properties store unicode. But if you assign a byte ('raw') string (type 'str') to a text property, it will attempt to cast to unicode, which uses the system default encoding, which is ASCII. You need to decode strings explicitly if you want to do otherwise.
Nick Johnson
Thanks Nick. I was trying to say that `TextProperty` cannot store `str` objects which contain non-ASCII bytes but (as you pointed out) my comment didn't make that clear so I've deleted it.
David Underhill