views:

913

answers:

4

I have a table watchlist containing today almost 3Mil records.

mysql>  select count(*) from watchlist;
+----------+
| count(*) |
+----------+
|  2957994 |
+----------+

It is used as a log to record product-page-views on a large e-commerce site (50,000+ products). It records the productID of the viewed product, the IP address and USER_AGENT of the viewer. And a timestamp of when it happens:

mysql> show columns from watchlist;
+-----------+--------------+------+-----+-------------------+-------+
| Field     | Type         | Null | Key | Default           | Extra |
+-----------+--------------+------+-----+-------------------+-------+
| productID | int(11)      | NO   | MUL | 0                 |       |
| ip        | varchar(16)  | YES  |     | NULL              |       |
| added_on  | timestamp    | NO   | MUL | CURRENT_TIMESTAMP |       |
| agent     | varchar(220) | YES  | MUL | NULL              |       |
+-----------+--------------+------+-----+-------------------+-------+

The data is then reported on several pages throughout the site on both the back-end (e.g. checking what GoogleBot is indexing), and front-end (e.g. a side-bar box for "Recently Viewed Products" and a page showing users what "People from your region also liked" etc.).

So that these "report" pages and side-bars load quickly I put indexes on relevant fields:

mysql> show indexes from watchlist;
+-----------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table     | Non_unique | Key_name  | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-----------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| watchlist |          1 | added_on  |            1 | added_on    | A         |        NULL |     NULL | NULL   |      | BTREE      |         |
| watchlist |          1 | productID |            1 | productID   | A         |        NULL |     NULL | NULL   |      | BTREE      |         |
| watchlist |          1 | agent     |            1 | agent       | A         |        NULL |     NULL | NULL   | YES  | BTREE      |         |
+-----------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+

Without the INDEXES, pages with the side-bar for example would spend about 30-45sec executing a query to get the 7 most-recent ProductIDs. With the indexes it takes <0.2sec.

The problem is that with the INDEXES the product pages themselves are taking longer and longer to load because as the table grows the write operations are taking upwards of 5sec. In addition there is a spike on the mysqld process amounting to 10-15% of available CPU each time a product page is viewed (roughly once every 2sec). We already had to upgrade the server hardware because on a previous server it was reaching 100% and caused mysqld to crash.

My plan is to attempt a 2-table solution. One table for INSERT operations, and another for SELECT operations. I plan to purge the INSERT table whenever it reaches 1000 records using a TRIGGER, and copy the oldest 900 records into the SELECT table. The report pages are a mixture of real-time (recently viewed) and analytics (which region), but the real-time pages tend to only need a handful of fresh records while the analytical pages don't need to know about the most recent trend (i.e. last 1000 views). So I can use the small table for the former and the large table for the latter reports.


My question: Is this an ideal solution to this problem?

Also: With TRIGGERS in MySQL is it possible to nice the trigger_statement so that it takes longer, but doesn't consume much CPU? Would running a cron job every 30min that is niced, and which performs the purging if required be a better solution?

A: 

Something that often helps when doing bulk loads is to drop any indexes, do the bulk load, then recreate the indexes. This is generally much faster than the database having to constantly update the index for each and every row inserted.

Adam Batkin
Won't work since reads and writes are interleaved. Rebuilding the indexes would lock the entire table for several minutes, which would render the site unusable.
Emil H
In general it is still a valid optimization, and it might still be faster than doing a thousand concurrent inserts.
Adam Batkin
A: 

A quick fixed might be to use the INSERT DELAYED syntax, which allows mysql to queue the inserts and execute them when it has time. That's probably not a very scalable solution though.

I actually think that the principles of what you will be attempting is sound, although I wouldn't use a trigger. My suggested solution would be to let the data accumulate for a day and then purge the data to the secondary log table with a batch script that runs at night. This is mainly because these frequent transfers of a thousand rows would still put a rather heavy load on the server, and because I don't really trust the MySQL trigger implementation (although that isn't based on any real substance).

Emil H
A: 

Write operations for a single row into a data table should not take 5 seconds, regardless how big the table gets.

Is your clustered index based on the timestamp field? If not, it should be, so you're not writing into the middle of your table somewhere. Also, make sure you are using InnoDB tables - MyISAM is not optimized for writes.

I would propose writing into two tables: one long-term table, one short-term reporting table with little or no indexing, which is then dumped as needed.

Another solution would be to use memcached or an in-memory database for the live reporting data, so there's no hit on the production database.

One more thought: exactly how "live" must either of these reports be? Perhaps retrieving a new list on a timed basis versus once for every page view would be sufficient.

richardtallent
you are right. 5sec was an exaggeration
rwired
A: 

Instead of optimizing indexes you could use some sort of database write offload. You could delegate writing to some background process via asynchronous queue (ActiveMQ for example). Inserting a message into ActiveMQ queue is very fast. We are using ActiveMQ and have about 10-20K insert operations on test platform (and this is single threaded test application! So you could have more).

dotsid