views:

684

answers:

6

I have a PHP script that builds a binary search tree over a rather large CSV file (5MB+). This is nice and all, but it takes about 3 seconds to read/parse/index the file.

Now I thought I could use serialize() and unserialize() to quicken the process. When the CSV file has not changed in the meantime, there is no point in parsing it again.

To my horror I find that calling serialize() on my index object takes 5 seconds and produces a huge (19MB) text file, whereas unserialize() takes unbearable 27 seconds to read it back. Improvements look a bit different. ;-)

So - is there a faster mechanism to store/restore large object graphs to/from disk in PHP?

(To clarify: I'm looking for something that takes significantly less than the aforementioned 3 seconds to do the de-serialization job.)

+1  A: 

If you want speed, writing to or reading from the file system in less than optimal.

In most cases, a database server will be able to store and retrieve data much more efficiently than a PHP script that is reading/writing files.

Another possibility would be something like Memcached.

Object serialization is not known for its performance but for its ease of use and it's definitely not suited to handle large amounts of data.

Techpriester
Is there no binary serialization format for PHP that writes memory bytes to the disk and simply reads them back again? If the CSV is all strings and the index object actually contains less info than the text file, why must its serialized form be so bloated?
Tomalak
@Tomalak: check out pack/unpack
Robert
@Robert: Looks like pack works for individual values only, not for complex objects.
Tomalak
@tomalak: serialize is slower because it does a lot of things that you don't always see when it comes to objects and classes.It also relies heavily on recursion to build a string representation of nested data structures which may also be slow.I think, when you already have table oriented data (csv) a relational database is the best option.
Techpriester
A: 

i see two options here

string serialization, in the simplest form something like

  write => implode("\x01", (array) $node);
  read  => explode() + $node->payload = $a[0]; $node->value = $a[1] etc

binary serialization with pack()

  write => pack("fnna*", $node->value, $node->le, $node->ri, $node->payload);
  read  => $node = (object) unpack("fvalue/nre/nli/a*payload", $data);

It would be interesting to benchmark both options and compare the results.

stereofrog
The tree has a root node. Would it be enough to `pack()` that root node, I mean would it pack the entire graph?
Tomalak
no, pack only processes individual variables
stereofrog
Then it is not an option, I'm afraid. :-\
Tomalak
+1  A: 

It seems that the answer to your question is no.

Even if you discover a "binary serialization format" option most likely even that would be to slow for what you envisage.

So, what you may have to look into using (as others have mentioned) is a database, memcached, or on online web service.

I'd like to add the following ideas as well:

  • caching of requests/responses
  • your PHP script does not shutdown but becomes a network server to answer queries
  • or, dare I say it, change the data structure and method of query you are currently using
zaf
I think you are right here. Too bad, but that's how it is.
Tomalak
You have a rich data source which offers many creative ideas, I'm sure you'll come up with something very smooth.
zaf
A: 

var_export should be lots faster as PHP won't have to process the string at all:

// export the process CSV to export.php
$php_array = read_parse_and_index_csv($csv); // takes 3 seconds
$export = var_export($php_array, true);
file_put_contents('export.php', '<?php $php_array = ' . $export . '; ?>');

Then include export.php when you need it:

include 'export.php';

Depending on your web server set up, you may have to chmod export.php to make it executable first.

dave1010
A: 

Igbinary, learn it Mother-What-The-Fuckers.

var_export can only be fast, if you have bytecode caching, ie APC and you read it multiple times, or even compile it to bytecode on output. It wouldn't be easy to interpret it as fast from unusual storage systems as igbinary.

Second:

Use GEOIP, they provide a CSV file and surprise surprise, a binary file with a fast PHP wrapped library for reading it.

Dump the data into a persistent storage engine, IE mysql, or use the library, or something...

I can't believe what shitty advice stack-overflow commenters have given on this one.

smeghead
-1 for being foulmouthed and rude.
Ryan Gooler
A: 

SQLite comes with PHP, you could use that as your database. Otherwise you could try using sessions, then you don't have to serialize anything, you just saving the raw PHP object.

Brent Baisley
@Brent: Can I share the object between sessions in PHP?
Tomalak
You couldn't share it between different sessions. Although you could probably get everyone using the same session by setting a custom session ID. Otherwise you would have to look into using shared memory. http://php.net/manual/en/book.shmop.php
Brent Baisley