Optimizing php command line scripts to process large flat files

views:

654

answers:

Optimizing php command line scripts to process large flat files

I have a large flat file that I need to process in php. I convert the flat file into a normalized database in mysql. There are several million lines in the flat file.

I originally tried to use an ORM system while importing the flat file. There was a massive php memory leak problem with that design even with careful freeing of objects. Even if I ensured that there was enough memory, the script would take about 25 days to run on my desktop.

I stripped out the overhead and rewrote the script to directly build mysql commands. I removed AUTO INCREMENT from my design since that required me to as Mysql what the last id entered was in order to make relations between data points. I just use a global counter for database ids instead and I never do any lookups, just inserts.

I use the unix split command to make lots of small files instead of one big one, because there is a memory overhead associated with using a file pointer again and again.

Using these optimizations (hope they help someone else) I got the import script to run in about 6 hours.

I rented a virtual instance with 5 times more RAM and about 5 times more processor power than my desktop and noticed that it went exactly the same speed. The server runs the process but has CPU cycles and RAM to spare. Perhaps the limiting factor is disk speed. But I have lots of RAM. Should I try loading the files into memory somehow? Any suggestions for further optimization of php command line scripts processing large files are welcome!

+3 A:

You won't like it but...sounds like you are using the wrong language for the task in hand. If you want to get some huge leaps in speed then a port to a compiled language would be the next step to go. Compiled languages run much, much faster than a scripting language ever will so you'll see your processing time drop off.

Additionally you might be able to dump the data into the DB using a build in command. Postgres had one (Dump? Load? something like that) which would read in a tab delimited text file who's columns matched up with the columns in the table. That would allow you to just focus on getting a text file in the right format and then spitting it into DB with one command and let it handle the optimisation of that rather than yourself.

You've done the right thing with knocking the ORM on the head, splitting the file should not be needed though as your text file reader should just use a buffer internally so it "should" not matter but I'm not a *nix guy so could be wrong on that front.

We've done something similar with a .net app that chomps through 20Gb of files every morning doing RegExp on every line, keeps a in memory hash for unique records and then pokes new ones into a DB. From that we then spit out 9000+ JS files using a Ruby Script for ease (this is the slowest part). We used to have the importer written in Ruby too and the whole thing took 3+ hours, re-write in .net runs the whole process in about 30-40 mins and 20 of that is the slow Ruby script (not worth optimising that anymore though it does the job well enough).

Pete Duncanson 2009-07-24 09:39:23

Sad but true. Try to import it directly to postgresql (can handle those operation better than mysql) or use another language (python?)

DaNieL 2009-07-24 10:50:23

Except from optmizations of the script you would suggest to try any PHP accelerator (e.g.: eaccelerator.net). If that does not help I would suggest to use a language / platform that is made for this kind of tasks.

Hippo 2009-07-24 09:43:26

I don't think that'll make much difference. Those accelerators works by caching opcodes, which means that the startup time of scripts is reduced. It doesn't make any difference for the actual runtime.

troelskn 2009-07-24 09:54:12

eAccelerator includes a opt-code optimizer.. some kind of "pre-JIT".. but I do not how much difference this makes.

Hippo 2009-07-24 10:48:38

+3 A:

A couple of important design recommendations for such a task:

Don't read the entire file into memory at once. Use a file pointer and read in reasonable chunks (say, a few kilobytes .. depends on the average record-size). Then process each record and ditch out the buffer. I'm not sure from your description whether you're already doing this or not.

If your mysql storage type supports transactions (the table must be InnoDB), you can use them for optimisations. Start a transaction and process f.ex. 100k rows, then flush by committing the transaction and opening a new one. This works because MySql will only update the index once, instead of for each row.

Another option is to use bulk insert. If your database isn't local (eg. you connect over network), this can give a boost. I think (not sure though) it also gives the same benefits as transactions - possibly even for MyIsam tables.

Finally, if nothing else works, you can remove php from the equation and use LOAD DATA INFILE. You may have to pre-process the file first, using php or some other text-processing language (awk or sed have very good performance profiles)

troelskn 2009-07-24 10:05:22

+2 A:

The time you are spending, beeing similar between different machines, might be because of the communication between the PHP script and the MySQL server : for each MySQL request :

you build the request in the PHP script (depends on the power of the machine, but really fast)
you have to send that request to the MySQL server (through network, or local socket ; takes time)
the MySQL server has to deal with the data (store it, create indexes, use it's locks for security, ... )
The answer ("ok", data inserted) has to go back to PHP (same : network or socket ; ie, slow)
and it does all this every time.

What takes time is probably not really on the PHP side ; most probably, it's between PHP and MySQL -- and there might no be much you can do about that.

If you have a quite powerful machine, what I would suggest is :

split your data in X (not too man ; say X = 6 for instance) parts
modify your PHP script so that you can launch it 6 times in parallel, giving it as a parameter the number of the part it should work on.
launche the script 6 times.

The first parallel execution of the script will deal with 6 times less data ; same for the others... And they will work in parallel... So, in the end, the whole process will take maybe 4 times less time :-)

It will probably not take 6 times less time : working with parallelisation means adding some load on the machine, and that MySQL will have some requests in concurrency -- but with only a couple of parallel processes, it'll be fine.

As a side note : doing this from PHP might not be the best thing. Here is another way I can think about :

use a script (like PHP or Perl or shell or whatever) to :
- read the input files
- generate insert requests (but not send the to the MySQL server)
- write those requests to a file
when all the request for the millions of lines are in the file :
- launch, in one shot, that file to MySQL.
- something like this would do, in command line : "mysql --host=HOST --user=USER --password=PASSWORD DATABASE_NAME < inserts-commands.sql"

This way, just make sure the SQL requests are OK in the file, and then, MySQL imports everything in one shot : you don't have to go from PHP to MySQL for each request -- it should go really faster.

Hope this helps, have fun !

Pascal MARTIN 2009-07-24 10:45:59

ansaurus

tags:

views:

answers:

Optimizing php command line scripts to process large flat files

related questions