tags:

views:

225

answers:

4

Hello to all.

Using PHP (1900 secs time limit and more than 1GB memory limit) and MySQL (using PEAR::MDB2) on this one...

I am trying to create a search engine that will load data from site feeds in a mysql database. Some sites have rather big feeds with lots of data in them (for example more than 80.000 records in just one file). Some data checking for each of the records is done prior to inserting the record in the database (data checking that might also insert or update a mysql table).

My problem is as many of you might have already understood...time! For each record in the feed there are more than 20 checks and for a feed with eg: 10.000 records there might be >50.000 inserts to the database.

I tried to do this with 2 ways:

  1. Read the feed and store the data in an array and then loop through the array and do the data checking and inserts. (This proves to be the fastest of all)
  2. Read the feed and do the data checking line by line and insert.

The database uses indexes on each field that is constantly queried. The PHP code is tweaked with no extra variables and the SQL queries are simple select, update and insert statements.

Setting time limits higher and memory is nor a problem. The problem is that I want this operation to be faster.

So my question is: How can i make the process of importing the feed's data faster? Are there any other tips that I might not be aware of?

A: 

You might take a look at php's PDO extension, and it's support for preapeared statements. You could also consider to use stored procedures in mysql. 2) You could take a look at other database systems, as CouchDB and others, and sacrifice consistency for performance.

erenon
Doesn't MDB2 use PDO?
jeezTech
@jeezTech: Even if it does thats one more extra layer.
prodigitalson
@erenon: You have a good point here.
jeezTech
sorry! the above comment goes to @prodigitalson :)
jeezTech
why do I have to always repeat "comment after downvote" ?
erenon
+1  A: 

If your import is a one time thing, and you use a fulltext index, a simple tweak to speed the import up is to remove the index, import all your data and add the fulltext index once the import is done. This is much faster, according to the docs :

For large data sets, it is much faster to load your data into a table that has no FULLTEXT index and then create the index after that, than to load data into a table that has an existing FULLTEXT index.

Wookai
+2  A: 

Using LOAD DATA INFILE is often many times faster than using INSERT to do a bulk load.

Even if you have to do your checks in PHP code, dump it to a CSV file and then use LOAD DATA INFILE, this can be a big win.

Bill Karwin
This, in combination with dropping indexes before the LOAD DATA query, and rebuilding them after, is pretty much the ultimate in data import performance.
Frank Farmer
A: 

I managed to double the inserted data with INSERT DELAYED command in 1800 sec. The 'LOAD DATA INFILE' suggestion was not the case since the data should be strongly validated and it would messup my code. Thanks for all your answers and suggestions :)

jeezTech
LOAD DATA INFILE is still an option: you build a new "INFILE" from the data you read in, after validating it. I don't think anyone was suggesting a LOAD DATA INFILE the original, unvalidated file itself. I don't have solid numbers for MySQL, but I can tell you that for postgres, there's *nothing* faster than the equivalent of LOAD DATA INFILE ('COPY' in postgres http://enfranchisedmind.com/blog/2006/11/04/postgres-for-the-win/)
Frank Farmer