views:

195

answers:

3

I'm pretty new to caching strategies and implementations. I'm working on a project that will be database intensive, but also have information being updated and changed very regularly.

I've found enough info to know generally how to develop the caching function, but what I'm unsure about is the general strategy.

If I cache all query results and group them by logical things that I can clear on triggers that make sense, I'll probably have tens of thousands (at least) tiny files in my cache. Would it make more sense to cache only large query results?

I know that this is a somewhat hardware specific question, but generally speaking at what volume of files does caching become somewhat pointless? Meaning, if you're loading up the file system with all of these tiny files, does access to them eventually become slow enough that you might as well have just not cached the information to start with?

Thanks all, I'm interested in any opinions you have to offer

EDIT: Based on the responses regarding this being absolutely application specific, let me pose the question this way which should be universal:

Assuming that I have an application that depends on one table with 1,000,000 items in it...

Would it be quicker to do a query to retrieve one of those items directly from the database, or to retrieve one of those items from my cache directory with 1,000,000 files, each containing the details of one of those items?

EDIT: Apparently 100,000 wasn't enough to get a valid answer, let's make it 1,000,000. Anyone want to go for 1,000,000,000? Because I can do it...

+2  A: 

The general rule is: do not cache while it is not necessary and cache only the things that need to be cached.

zerkms
Is that a cache-y way of saying, get it to work first, and then worry about how quick it is?
Brian Hooper
Ok. So I'll make it an option and only turn it on when things start to slow down. :) But, is it possible to cache so many things that it either doesn't improve performance or actually degrades it?
Chris
**NEVER** cache because your application is slow. That is a recipe for disaster. Cache because you have too much load (meaning too many concurrent requests). But if you application is slow without cache, adding it is only putting a band-aid on a gunshot wound... If it's slow, fix the slowness...
ircmaxell
@Brian Hooper: exactly. each optimization (like caching or changing algorythm) is very context-depending. so the good solution at start can give weird results further. so why to waste a time?
zerkms
@ircmaxell: say this to facebook, who **caches** the data in memcached instead of improving mysql performance, huh.
zerkms
@Chris: for what reason? cache **only** things that need to be cached. and they can be detected **only** in production environment and **only** in real application workload. so, write an application and profile it.
zerkms
@zerkms they cache the data in memcache because they have millions of page views per second. Not because each query is slow. Even uncached, I'd put money that a single page view would be rendered in less than 0.1 seconds. So they cache not because it's slow, but because they have concurrency issues... Otherwise, what happens if you get a cache flush event (restart, server crash, memcached server goes down, etc)? Does their site come to a crashing stop? No...
ircmaxell
"Not because each query is slow". @ircmaxell: i did not say that he need to cache query if it is without indexes and slow as result, but I said cache if it is slow in the meaning "it is slow and it cannot be improved in other way". and my facebook sample proves that: they cache because without cache data extraction would be slow **because of concurrency**. would not it?
zerkms
There's a huge difference between slow and expensive. An expensive query may be able to complete very fast (in a few milliseconds), but be expensive enough that more than 50 or 100 of them at the same time will bring a server to its knees. They cache to reduce the expense, not the slowness. I'm really sticking on this because there are a lot of people who will read this and say `My app is slow, let me just add caching to "fix it"`. And that's flat out wrong. You add caching to reduce server load, not speed anything up...
ircmaxell
@ircmaxel - What are you smoking?
Coronatus
@ircmaxell: he did not asked about **speed**. and i **did not** answered about speed **improvements**. also - i did give hime a hint about profiling. well - is my answer still terrible wrong?
zerkms
@ircmaxel - I understand your point and it's well taken. Let's just say though that my question wasn't about motive. I'm going to have a very query driven site with a lot of information and this is more a question of general file system performance vs. database performance.
Chris
@Chris: can you answer to me then: **why do you want to cache**? Do you have any reason to cache?
zerkms
@ircmaxel - One additional point - I'm pretty sure that for anyone that has the slightest bit of foresight, reduced server load equals increased speed - as it would be applied across your user base. So at what point does an overloaded cache increase server load beyond the usefulness of caching in the first place?
Chris
@zerkms - does it really matter? My code runs fast when I'm the only one using it, but before I roll this out to anyone else, I want to make sure that I'm using whatever hosting I'm paying for in the most efficient way possible.
Chris
@Chris: noone can get to know what will be a bottleneck in the production environment. this thesis has been proven by many professionals million times. so wait until production and profile. don't waste time on making job that possible will never give any benefits.
zerkms
@Chris: Well that's the point, decreased server load does not equal increased speed. It'll reduce the slowdown due to the load, but it won't "speed it up". From my experience, if you cache is overloaded to the point that it defeats the point of caching, you really need to re-engineer your caching system... Cache should never be slower than what your caching. And typically there will be other bottle-necks prior to your cache...
ircmaxell
@ircmaxell - But at this point we're talking semantics. If it's reducing the slowdown, and you ask any user what's happening, they'll say it's faster. My script is optimized - i.e. I can't get any more speed from changing my script - and at this point I don't need more speed. And when my server starts getting hammered, it's going to slow down. And my goal is to prepare for that slowdown and maximize whatever hardware my app is sitting on.
Chris
@ircmaxell: you're mixing up "speed" and "performance". let i demonstrate this on simple example: mamcached on dedicated server can be slower than reading local file when it is 1 query per second (your idea "Cache should never be slower than what your caching" fails here), BUT **it will be faster** when it will be 100queries per second. so performance/speed should be compared in the context of real environment.
zerkms
"And my goal is to prepare for that slowdown and maximize whatever hardware my app is sitting on." @Chris: noone can predict what will be a bottleneck. so it is pointless to cache everything.
zerkms
@zerkms: actually on any reasonably intelligent modern operating system, if you're accessing a file 100 times per second (reads only), it should actually be faster than cache still since the OS has the file stored in memory (so it doesn't even need to hit the disk after the first read). That's why it's important to give a server enough RAM so it's happy...
ircmaxell
omg, that was a sample. mysql on a trivia query will be faster than memcached. and that doesn't mean, and you know this, that we should remove memcached then.
zerkms
+9  A: 

Use MySQL's built in query cache instead of trying to maintain it yourself. It will automatically clear cached queries to tables when they are written to. Plus, it works in memory so it should be very efficient...

Also, don't just cache queries. Try to cache entire segments of the application at different stages in the rendering cycle. So you can let MySQL cache the queries, then you cache each individual view (rendered), each individual block, and each page. Then, you can choose whether or not to pull from cache based upon the request.

For example, a non-logged-in user may get the full page directly from cache. But a logged-in user may not be able to (due to username, etc). So for him, you may be able to render 1/2 your views on the page from cache (since they don't depend on the user object). You still get the benefit of caching, but it'll be tiered based upon need.

If you're really expecting a lot of traffic, it's definitely worth looking into Memcached. Let MySQL store your queries for you, and then store all user-land cache items in memcache...

Edit: To answer your edit:

Filesystems can become slow if a single directory grows big. As long as you're "namespacing" by directory (so each directory only has a small portion of cache files), you should be fine from that standpoint. As for the exact threshold, it really will depend on your hardware and filesystem more than anything else. I know EXT3 gets quite slow if there are a load of files in a single directory (I have directories with literally hundreds of thousands of files, and it can take up to half a second to simply stat() one of the files, let alone do any kind of directory listing)...

But realize that if you add another server, you're going to either have duplication of cache (which is not a good thing), or are going to have to rewrite your entire cache layer. Is there a reason not to go with Memcached right from the start?

EDit 2: To answer your latest edit:

It's still too tough to call. I have an application that has a database with around 1.5 billion rows (growing at around 500k per day). We don't use any caching on it at all because we don't have concurrency issues. And even if we did, we'd be better off throwing more MySQL servers at it rather than adding caching since any form of cache would have such a low hit rate that it wouldn't be worth the development time to add it.

And that's the reason I am so adamant about not caching for speed. There will always be an object that is not in cache. So if you hit a page with one of those objects, it still needs to be fast. As a rule of thumb, I try to cache anything that will be accessed again in the next few minutes (I keep a time to live of about 5 minutes in production on other applications anyway). So if items aren't getting more than a few hits in that time span, or the hit rate is very low (less than 90%), I don't bother caching that item....

ircmaxell
Thank you. That was exactly what I was curious about.
Chris
A: 

This is both hardware- and application-dependent. You need to perform benchmarks to determine the threshold at which OS indexing becomes greater than the data storage/retrieval duration (both at the MySQL level and cached file access level). And you also need to compare that against the acceptable (very subjective) threshold of your audience.

stillstanding