views:

142

answers:

2

I'm aware that serializing is used to convert data types into a storable format, for purposes such as caching.

What I'm more specifically asking is, what are the circumstances in which you should actually decide to store data ( using serialize() in PHP, pickle module in Python, et cetera )?

Let's say we had a high traffic website, and in our /blog page we are using static content xml files, a gettext mo file, and dynamically generated content from a database.

Example #1:

The file we rely on for static content is en/blog.xml:

'<content><![CDATA[
<h1>Welcome to my blog!</h1>
<p>Lorem ipsum dolor sit amet..</p>

]]></content>'

Would we want to serialize this xml file itself and store it in cache?

Example #2:

We also have a dynamically generated form, normally I would assume I would not serialize anything because it's server-side generated and dynamic, but our form field labels are internationalized and the user requested this page in spanish, therefore we are using a translation class which grabs form field labels stored in mo/csv/xml format.

Contents of contact-us.php:

<label for="first_name"><?php echo $L->_("First Name");?></label>
<input id="first_name" name="first_name" type="text">

The "First Name" message id translation is pulled from the application-level translation file, which we parse and store in an array which resides in our translation class. So it would be ideal for our code to not parse the mo file on every page request, and instead serialize the whole array after parsing the mo, and then rely on the serialized dump of that?

Example #3:

Let's say on our blog page we're pulling in the 5 most recent blog posts.

$posts = BlogClass->sql('SELECT blog_message, blog_author FROM blog_posts LIMIT 5 ORDER BY blog_date DESC');

Would we want to rely on something like memcache and just set a key to the result of the sql statement, would it serialize the results of the query, or?

Bonus:

If anyone could actually provide specific examples of efficient/practical uses/mis-uses of serialization, that'd be great - something like a multi-page, huge huge form that pulls in database information and stores stuff in sessions, or any examples where you had to rely on serialize..

+4  A: 

Example 1

Profile.

  • Is it prohibitively costly to generate your content pages?
  • Is it significantly less costly to deserialize your generated content?

If both answers are yes, consider it.

Example 2

Profile.

  • Is it prohibitively costly to generate your content pages?
  • Is it significantly less costly to deserialize your generated content?

If both answers are yes, consider it.

Example 3

Profile.

  • Is that query prohibitively expensive?
  • Is it significantly faster to grab the data from memcached?

If both answers are yes, consider it.

Bonus

I never serialize my data just because I can. I need to have a reason to do so, otherwise it's just premature optimization. There are several factors that come into deciding whether this should be done.

Performing sorting or other operations on a serialized set of data

This will almost always be a bad idea. e.g. if you serialized a resultset from a database, then needed to reorder this set by some field, you're shooting yourself in the foot.

Messaging

If you need to communicate serialized data to other services/languages then choice of serialization is critical. I avoid serializing using a language specific method if I know or think that other things may need to read it. JSON is often an ideal format for cross language serialization.

Updating serialized data

You have to be willing to regenerate the serialized data for updates to it's source. It will be prohibitively expensive to do any type of complex updates to the serialized data.

Human readability

If you need to read it easily, I suggest avoiding language specific formats. I suggest JSON for this.

Edit:

I just looked again at the query in Example 3. That is an extremely simple query, you're only selecting 2 fields, and ordering by a date field. With a properly indexed table this query should be trivial, and I would not suggest caching something like this into memcached.

hobodave
Generally it's less expensive to deserialize though, right? Or is that a completely silly question because it always depends on the circumstances? PS - excellent answer.
meder
@meder: It's not silly. I am a firm believer in avoiding premature optimization. You have to measure the differences and see if it's worth it. You are introducing complexity into your application, and thus an increased potential for chaos (bugs).
hobodave
+2  A: 

What are the circumstances in which you should actually decide to store data ( using serialize() in PHP, pickle module in Python, et cetera )?

That question is easy to answer. The various scenarios don't actually have much relevance.

Here's the answer You serialize when you have to. No sooner.

Many API's will not accept Python objects. When the API cannot accept a Python object, then you can often provide a string. That's when you serialize.

Example. You want to save a Python object on persistent storage. Sadly a file object can't write a Python object. So you serialize.

Example. You want to send a Python object to another process. You're using a socket, named pipe or whatever. These are all file objects, and file objects can't write a Python objects. So you serialize.

That's when you serialize.

  1. XML files are serialized DOM trees. The Python object is a DOM tree. The XML file is one way to serialize the DOM tree. I don't understand this example.

  2. Form label strings are strings. They don't need to be serialized. I18N is handled separately from your application. http://docs.python.org/library/i18n.html I don't understand this example.

  3. This is a query. You don't serialize anything. You just do the query. The results are (in principle) always changing, so any serialization is the previous result, not the current result, so you just don't.

Bonus. Multi-page, huge form? You don't serialize anything. You just update the session in your web framework. The web framework's session manager might serialize the Python object, but that's why you used a framework -- so you wouldn't have to care.

Serialization is used to write a Python object to a file. This -- in web applications -- is rare. Mostly, you write to databases using SQL.

S.Lott
For #2 I actually meant serializing the result of parsing an application-level binary `mo` file ( in the form of an array ), not a literal single string. Thanks for taking the time to answer, this cleared a bit up as well.
meder
@meder: Serialize an optimized binary `.mo` file? That's crazy. Why would you undo the optimization?
S.Lott