views:

618

answers:

5

Ive recently implemented memcache on my site which has been under heavy mysql load (mysql was as optimized as I could make it). It solved all my load issues, and site is running beautifully.

The problem that Im facing now is stale cached values. I have 1hr auto expiration times on most pages, and Im also deleting the key when the value in the DB chnages, but Im having difficulty keeping track and efficiently clearing out all the keys.

On some pages, its trivial. I can make the key be item_id (item_4653 for example), and when the data for it is updated, or the item is deleted, the key is cleaned out.

But on most pages, I take the script filename + querystring, md5 it, and use that as the key in memcache. This is especially useful for complex urls (which are very common).

For example I have the following page loaded.

index.php?search_keywords=good&search_section=1&sort=release&page=2

It will contain a list of items, which will be fetched from memcache. Another user then submits an item, which has "good" in its title, and it happens to be in the range of values, where it would appear on page 2, except it will not appear there, until the cache is refreshed. What makes this even more complicated, is that the newly added item will also appear on index.php?sort=newest, as well as index.php?category=some_category?page=1 and etc. Each 1 of those will have a unique key (md5 of the script name + query string).

So the newly added item might appear on dozens of pages, if they were fetched from a live DB, but it wont be visible on any of them until the stale cache is updated. The only option is to wait for the item to expire automatically.

This problem becomes even more pronounced on my forum (custom coded), where the values HAVE to be updated on demand, for all possible cached page combinations. Lets say I have 4 page thread, and I notice 3 spam posts on page 2. After deleting them, page 2 is rebuilt, but then it also has to rebuild pages 3 and 4, otherwise there will be duplicate posts on newly rebuild page 2, and old page 3. Thats just 1 example to..... there are dozens of these scenarios.

Any ideas?

+1  A: 

You may benefit from a simpler naming scheme for your memcached keys - so they are easier to delete. Seems like with the MD5 solution, you might be creating too many keys for things which generally show the same data.

You might also consider a shorter cache time, like 20 minutes?

Also - how many items per page are you retrieving for each of these search result pages? If you have a paginated search - getting 50 items from the server shouldn't be too intensive.

You may have tuned the mysql server, but have you tuned the queries (improving them by examining the EXPLAIN output), or table structures (by adding useful indexes)?

I'm also wondering how intense the queries on those pages are. Do you join several tables? You may benefit from doing a simpler query - or a few queries (outlined below).

Alternatively - For each row in the result, do you run another query - or several? You may benefit from a slightly more complex search query that avoids you having to do the nested queries. Or, are you being bitten by an ORM library which does the same thing, runs a search, then queries for sub items on each iteration?

The 'a few simpler queries' solution - say for example - if you've got an item, and want to know it's category in the result set...

In stead of this:

SELECT i.id, i.name,
c.category FROM items AS i
INNER JOIN categories AS c
ON i.category_id = c.id;

This is a simple example - but say there were categories, and several other JOINs involved.

You might go this route:

// run this query
SELECT id, category FROM categories - and put that into a keyed array.

// then in PHP create an array keyed by the id
$categories = array();

while ( false !== ( $row = mysql_fetch_assoc ( $result ) ) ) 
{
  $categories[ $row['id'] ] = $row['category'];
}

// and so on
$types = array(); // ...
// etc.

Then do your search but without all of the JOINS, just from the items table with your where clauses, and in the output say...

<?php foreach($items as $item): ?>
  <h4><?php echo $item['name']; ?></h4>
  <p>Category:  <?php echo $categories[ $item['category_id'] ]; ?></p>
  <p>Type:  <?php echo $types[ $item['type_id'] ]; ?></p>
  <!-- and so on -->
<?php endforeach; ?>

It's a little ghetto, but maybe this - and the other suggestions - will help.

Phillip Harrington
If the url query string is different, the data it shows is also different. There is no duplication going on. Im not running any join queries (except for searches).
Yegor
I realize the data will be different for a different query string.What I meant by a simpler cache key system was something you can anticipate - and easily delete.The rest of the advice was to the effect of - how can you speed up the queries so you don't have to memcached everything. If there aren't any joins going on then something else is slowing down your queries. Memcached is a good band-aid for that, but... Good luck!
Phillip Harrington
+2  A: 

Memcached::set has an expire parameter. Perhaps you can let this default to an hour, but for the pages that return search results - or in your forum, you can set this to a shorter period of time.

Phillip Harrington
I already talked about this in my original post. On a forum, even 1 minute is too long.
Yegor
Agreed, but for the search, it might help.
Phillip Harrington
I mean really - for the search (and as you said) - you have no way to anticipate what the query string will be - or which page a given result might land on - I think the md5 of the query string is a good key, but the expiration time might help you there - at least give you some relief. 5 minute updates isn't completely unreasonable there.In general though - especially with the forums - that's why I started by talking about exploring sql optimization again. I mean - how much data are we talking here. What are the explain statements telling you, and so on.
Phillip Harrington
Then it defeats the purpose of memcache, and it will run a query every time, since most searches arent going to be searches 10 times every second, or even minute. Just the top few will actually utilize the cached value, all other rare searches will always have an expired cache, and searching is the most costly operation in my case.
Yegor
A: 

What you could do to make sure that your cache is always up to date without doing lots of changes to your code is work with a "version cache". This does increase the number of memcache requests you will make, but this might be a solution for you.

Another good thing about this solution is that you can set expiration time to never expire.

The idea is to basically have a version number stored in memcache for in your case a certain keyword (per keywork, not combination). How to use this?

When someone submits a new item:

  • for every word in the title, do if(!Memcache:increment("version_" + keyword)) {Memcache:set("version_" + keyword);}

When someone executes a query:

  • the md5 thing that your are doing is already ok. Additionally you need to add the version of every keyword in your search string to the memcache key.

This ensures that as soon as a keyword has new results (or less when deleting), the version will be bumped and as such all related memcache queries.

Cache always up to date and queries can potentially stay longer than 1 hour in the cache.

Folke
+1  A: 

Couple simple things you can do:

First, if you really want to use the query string as a cache key, make it more deterministic and predictable. I'd do this by sorting the query string, e.g, : ?zed=7&alpha=1 is transformed to ?alpha=1&zed=7. Also strip out variables that aren't relevant to the caching key.

To handle the problem of the ?page parameter, and items not showing up because the cache hasn't refreshed, I've got a couple ideas:

Folke's idea of adding a 'version' to the cache key would work well. The same trick is used to easily make links like unvisited.

Another approach would be to store the number of pages in the cache value, and then, when the database is updated, iterate through the cache keys.

cache.put("keyword,page=3", array(num_pages=7, value=...))

...later...
update_entry()
num_pages, value = cache.get("keyword,page=3")
for i in num_pages:
  cache.flush("keyword,page="+i)

Whether this is a good idea or not depends on how many pages there are, and the chance of updates coming in while the loop is running.

A third idea is to cache the entire result set instead of just that page of results. This may or may not be an option depending up on the size of the result set. When that result set is updated, you just flush the cache for that keyword.

cache.put("keyword", array(0="bla", 1=foo", ...)
...later...
cache.get("keyword")[page_num]

A fourth idea is to change your caching backend and use something built to handle this situation. I dunno what other cache servers are out there, so you'll have to look around.

Finally, to supplement all this, you can try and be smarter about the expire time on cache entries. e.g., use the mean time between updates, or the number of queries per second for the keyword, etc.

Richard Levasseur
Paginated updating is a pretty good idea, but applying it to other situations might not be practical. What about having some sort of a "key index" which would keep track of keys that are grouped together... and flush these groups out when needed?
Yegor
Do you mean the "key index" maps keywords to a list of page numbers? That would optimize the loop and remove the guessing required. The increment() functions would be useful for that. Or do you mean maps a keyword to a list of other keys that have to be flushed? That'd work just as well (same thing, really)
Richard Levasseur
+7  A: 

Since you are caching entire pages in memcached, your pages can't share cached data from the database with each other. Say I have page1.php and page2.php, with page1 and page2 as keys in memcached. Both pages display items. I add a new item. Now I have to expire page1 and page2.

Instead, I could have an items key in memcached, that page1.php and page2.php both use to display items. When I add a new item, I expire the items key (or better, update it's value), and both page1.php and page2.php are up-to-date.

If you still want to cache the entire page, you could add information to your keys that will change when data being cached changes (this wouldn't make sense if the data changes too often). For instance:

"page1:[timestamp of newest item]"

This way you can look up the timestamp of the newest item, an inexpensive query, and build your cache key with it. Once a newer item is added, the cache key will change, automatically expiring. This method means you still have to hit the database to see what the newest item's timestamp is, every time.

Ben Marini
I think this is the best way to approach caching: cache the data itself and not the generated pages. You can maintain caches of individual items (such as forum posts or products) and caches or lists. If a particular list needs to update at a certain interval, have a thread on the server refresh the cache and have all clients just read from the cache. This way your DB load is constant. You can also update the relevant caches whenever the data is changed in the DB; if you write the new value to the cache at that time it saves a read later. Caching whole pages sounds like a world of pain.
Mr. Shiny and New