tags:

views:

158

answers:

6

This problem is pretty hard to describe and therefore difficult to search the answer. I hope some expert could share you opinions on that.

I have a table with around 1 million of records. The table structure is similar to something like this:

items{
  uid (primary key, bigint, 15)
  updated (indexed, int, 11)
  enabled (indexed, tinyint, 1)
}

The scenario is like this. I have to select all of the records everyday and do some processing. It takes around 3 second to handle each item.

I have written a PHP script to fetch 200 items each time using the following.

select * from items where updated > unix_timestamp(now()) - 86400 and enabled = 1 limit 200;

I will then update the "updated" field of the selected items to make sure that it wont' be selected again within one day. The selected query is something like that.

update items set updated = unix_timestamp(now()) where uid in (1,2,3,4,...);

Then, the PHP will continue to run and process the data which doesn't require any MYSQL connection anymore.


Since I have million records and each record take 3 seconds to process, it's definitely impossible to do it sequentially. Therefore, I will execute the PHP in every 10 seconds.

However, as time goes by and the table growth, the select getting much slower. Sometimes, it take more than 100 seconds to run!


Do you guys have any suggestion how may I solve this problem?

+2  A: 

I don't think the index on enabled is doing you any good, the cardinality is too low. Remove that and your UPDATEs should go faster.

I am not sure what you mean when you say each record takes 3 seconds since, you are handling them in batches of 200. How are you determining this and what other processing is involved?

RedFilter
I will fetch 200 data each time and do call some REST API. The round trip time for the REST API is around 3 seconds.Sometimes, the API will return error and I have to set enabled to 0 and that item will not get fetched anymore.
@terence410: It sounds like the problem has nothing to do with SQL or PHP - it is the REST API that is slow. If that is your code, then you are in luck, you can try to improve its performance. If not, there is not much you can do.
RedFilter
+1  A: 

You could try running this before the update:

ALTER TABLE items DISABLE KEYS;

and then when you're done updating,

ALTER TABLE items ENABLE KEYS;

That should recreate the index much faster than updating each record at a time will.

cHao
ENABLE KEYS on a big table can take a long time, perhaps not a great idea for updating a small number of rows.
David M
Maybe. But then, you're updating the index (perhaps in a big way) each and every time you update a row. Better to do it all at once, methinks, than to say "move these 500000 index entries" each and every time you do an update.
cHao
+3  A: 

There are two points that I can think of that should help:

a. unix_timestamp(now()) - 86400)

... this will evaluate now() for every single row, make it a constant by setting a variable to that value before each run.

b. Indexes help reads but can slow down writes

Consider dropping indexes before updating (DISABLE KEYS) - and then re-add them before reading (ENABLE KEYS).

amelvin
+1  A: 

You could do this:

  1. dispatcher.php: Manages the whole process.
    • fetches items in convenient packages from the database
    • calls worker.php on the same server with an HTTP post containing all UIDs fetched (I understand that worker.php would not need more than the UID to do its job)
    • maintains a counter of how many worker.php scrips are running. When one is started, the counter increments until a certain limit, when one worker returns then the counter is decremented. See "Asynchronous PHP calls?".
    • repeats until all records are fetched once. Maintain a MySQL LIMIT counter and do not work with updated.
  2. worker.php: does the actual work
    • does its thing with each item posted.
    • writes to a helper table the ID of each item it has processed (no index on that table)
  3. dispatcher.php: housekeping.
    • once all workers have returned, updates the main table with the helper table in a single statement
  4. error recovery
    • since worker.php would update the helper table after each item done, you can use the state of the helper table to recover from a crash. Saving the "work package" of each individual worker before it starts running would help to recover worker states as well.

You would have a multi-threaded processing chain this way and could even distribute the whole thing across multiple machines.

Tomalak
P.S.: If doing asynchronous work turns out as too difficult in PHP, you could implement the dispatcher in a more convenient language. The above is just an idea, I don't know if it is feasible. Maybe a few modifications are necessary to accommodate PHPs lack of native threads.
Tomalak
A: 

For a table with fewer than a couple of billion records, the primary key should be an unsigned int rather than a bigint.

A: 

One idea:

Use a HANDLER, that will improve your performance considerably:

http://dev.mysql.com/doc/refman/5.1/en/handler.html

David M
Thanks, let me try on that.