I have a large flat file that I need to process in php. I convert the flat file into a normalized database in mysql. There are several million lines in the flat file.
I originally tried to use an ORM system while importing the flat file. There was a massive php memory leak problem with that design even with careful freeing of objects. Even if I ensured that there was enough memory, the script would take about 25 days to run on my desktop.
I stripped out the overhead and rewrote the script to directly build mysql commands. I removed AUTO INCREMENT from my design since that required me to as Mysql what the last id entered was in order to make relations between data points. I just use a global counter for database ids instead and I never do any lookups, just inserts.
I use the unix split command to make lots of small files instead of one big one, because there is a memory overhead associated with using a file pointer again and again.
Using these optimizations (hope they help someone else) I got the import script to run in about 6 hours.
I rented a virtual instance with 5 times more RAM and about 5 times more processor power than my desktop and noticed that it went exactly the same speed. The server runs the process but has CPU cycles and RAM to spare. Perhaps the limiting factor is disk speed. But I have lots of RAM. Should I try loading the files into memory somehow? Any suggestions for further optimization of php command line scripts processing large files are welcome!