views:

363

answers:

5

Hi!

Here's the deal. We would have taken the complete static html road to solve performance issues, but since the site will be partially dynamic, this won't work out for us. What we have thought of instead is using memcache + eAccelerator to speed up PHP and take care of caching for the most used data.

Here's our two approaches that we have thought of right now:

  • Using memcache on >>all<< major queries and leaving it alone to do what it does best.

  • Usinc memcache for most commonly retrieved data, and combining with a standard harddrive-stored cache for further usage.

The major advantage of only using memcache is of course the performance, but as users increases, the memory usage gets heavy. Combining the two sounds like a more natural approach to us, even though the theoretical compromize in performance. Memcached appears to have some replication features available as well, which may come handy when it's time to increase the nodes.

What approach should we use? - Is it stupid to compromize and combine the two methods? Should we insted be focusing on utilizing memcache and instead focusing on upgrading the memory as the load increases with the number of users?

Thanks a lot!

+2  A: 

I would suggest that you first use memcache for all major queries. Then, test to find queries that are least used or data that is rarely changed and then provide a cache for this.

If you can isolate common data from rarely used data, then you can focus on improving performance on the more commonly used data.

AKRamkumar
@AKRamkumar Thanks for your help! That's another interesting angle for this issue.
Industrial
+2  A: 

Compromize and combine this two method is a very clever way, I think.

The most obvious cache management rule is latency v.s. size rule, which is used in CPU cached also. In multi level caches each next level should have more size for compensating higher latency. We have higher latency but higher cache hit ratio. So, I didn't recommend you to place disk based cache in front of memcache. Сonversely it's should be place behind memcache. The only exception is if you cache directory mounted in memory (tmpfs). In this case file based cache could compensate high load on memcache, and also could have latency profits (because of data locality).

This two storages (file based, memcache) are not only storages that are convenient for cache. You also could use almost any KV database as they are very good at concurrency control.

Cache invalidation is separate question which can engage your attention. There are several tricks you could use to provide more subtle cache update on cache misses. One of them is dog pile effect prediction. If several concurrent threads got cache miss simultaneously all of them go to backend (database). Application should allow only one of them to proceed and rest of them should wait on cache. Second is background cache update. It's nice to update cache not in web request thread but in background. In background you can control concurrency level and update timeouts more gracefully.

Actually there is one cool method which allows you to do tag based cache tracking (memcached-tag for example). It's very simple under the hood. With every cache entry you save a vector of tags versions which it is belongs to (for example: {directory#5: 1, user#8: 2}). When you reading cache line you also read all actual vector numbers from memcached (this could be effectively performed with multiget). If at least one actual tag version is greater than tag version saved in cache line then cache is invalidated. And when you change objects (for example directory) appropriate tag version should be incremented. It's very simple and powerful method, but have it's own disadvantages, though. In this scheme you couldn't perform efficient cache invalidation. Memcached could easily drop out live entries and keep old entries.

And of course you should remember: "There are only two hard things in Computer Science: cache invalidation and naming things" - Phil Karlton.

dotsid
Hi Dotsid, Really interesting thoughts you'd got. Very appreciated! Are you saying that it should be layered in a way that the data that is requested goes through the first layer of cache, which is memcache, and if the data in memcache is invalid, the next cache layer is hard-drive based, which if it's not valid any more opens a connection to the database and get the data the user has requested?
Industrial
Yeah. I'm added to answer some thoughts on cache invalidation also.
dotsid
Hi again Dotsid! A question - what method do you suggest on keeping track of the keys in the application? I mean, there's no apparent way to "tag" a key in memcache with it's origins?It would be super-sweet to be able to do this and invalidate all cache data related to one or more "categories","parents" or whatever that they can be sorted into depending on the app...
Industrial
Here's an old post regarding that by me: http://stackoverflow.com/questions/2510759/organizing-memcache-keysDidn't get any replies though :(
Industrial
Added some thoughts on tag based cache invalidation.
dotsid
Hi again Dotsid. I am trying to check out the memcached-tag that you linked. Couldn't install it on my windows based WAMP test environment. Guess that it's only linux suited, so I'll guess that I have to get a linux environment running before testing it out...
Industrial
Another question, Isn't it bad performancewise to do all the cache updating in background from eg. a cronjob? I mean, all potential data will end up being cached instead of the most used?
Industrial
Well, it may be a good choice, but only if you know all cache lines (and therefore queries in the system). This may be true if you have low variety queries (eg. cache for userA, userB and so on). If it's not (eg. cache for bulletins in directory A with price < $1000, cache for bulletins in directory A with price < $250 and and mint condition etc) this is not a good idea, because you simply couldn't predict which cache lines are frequently used and which are not. So, even if you have good caching layer you should think about providing low latency data storage.
dotsid
Hi, I do not know from the beginning which queries that will be popular, sadly enough. Really need to work out something with memcached-tag to organize the data stored...
Industrial
+1  A: 

You can delegate the combination of disk/memory cache to the OS (if your OS is smart enough). For Solaris, you can actually even add SSD layer in the middle; this technology is called L2ARC.

I'd recommend you to read this for a start: http://blogs.sun.com/brendan/entry/test.

mindas
Hi! As it seems right now, we will be using centOS. I will check out Solaris, but that would be a completely new thing to learn. I am not sure if we could sacrifice that part for starting all over again with learning the beginning and up from the OS.... Thanks a lot for your help though. Do you know other OS:es that support this feature?
Industrial
Well, it is your choice... but it might be cheaper/quicker to just use Solaris and get the caching for free. And you'd get the ZFS which is probably the best fs available today. Unfortunately, I am not aware of anything similar for Linux.
mindas
+1  A: 

Memcached is quite a scalable system. For instance, you can replicate cache to decrease access time for certain key buckets or implement Ketama algorithm that enables you to add/remove Memcached instances from pool without remap of all keys. In this way, you can easily add new machines dedicated to Memcached when you happen to have extra memory. Furthermore, as its instance can be run with different sizes, you can throw up one instance by adding more RAM to an old machine. Generally, this approach is more economic and to some extent does not inferior to the first one, especially for multiget() requests. Regarding a performance drop with data growth, the runtime of the algorithms used in Memcached does not vary with the size of the data, and therefore the access time depend only on number of simultaneous requests. Finally, if you want to tune your memory/performance priorities you can set expire time and available memory configuration values which will strict RAM usage or increase cache hits.

At the same time, when you use a hard-disk the file system can become a bottleneck of your application. Besides general I/O latency, such things as fragmentation and huge directories can noticeably affect your overall request speed. Also, beware that default Linux hard disk settings are tuned more for compatibility than for speed, so it is advisable to configure it properly before usage (for instance, you can try hdparm utility).

Thus, before adding one more integrating point, I think you should tune the existent system. Usually, properly designed database, configured PHP, Memcached and handling of static data should be enough even for a high-load web site.

Vitalii Fedorenko
Hi Vitalii. Thanks a lot for your help and your thoughts regarding this question!
Industrial
+1  A: 

Memcached is something that you use when you're sure you need to. You don't worry about it being heavy on memory, because when you evaluate it, you include the cost of the dedicated boxes that you're going to deploy it on.

In most cases putting memcached on a shared machine is a waste of time, as its memory would be better used caching whatever else it does instead.

The benefit of memcached is that you can use it as a shared cache between many machines, which increases the hit rate. Moreover, you can have the cache size and performance higher than a single box can give, as you can (and normally would) deploy several boxes (per geographical location).

Also the way memcached is normally used is dependent on a low latency link from your app servers; so you wouldn't normally use the same memcached cluster in different geographical locations within your infrastructure (each DC would have its own cluster)

The process is:

  1. Identify performance problems
  2. Decide how much performance improvement is enough
  3. Reproduce problems in your test lab, on production-grade hardware with necessary driver machines - this is nontrivial and you may need a lot of dedicated (even specialised) hardware to drive your app hard enough.
  4. Test a proposed solution
  5. If it works, release it to production, if not, try more options and start again.

You should not

  • Cache "everything"
  • Do things without measuring their actual impact.

As your performance test environment will never be perfect, you should have sufficient instrumentation / monitoring that you can measure performance and profile your app IN PRODUCTION.

This also means that every single thing that you cache should have a cache hit/miss counter on it. You can use this to determine when the cache is being wasted. If a cache has a low hit rate (< 90%, say), then it is probably not worthwhile.

It may also be worth having the individual caches switchable in production.

Remember: OPTIMISATIONS INTRODUCE FUNCTIONAL BUGS. Do as few optimisations as possible, and be sure that they are necessary AND effective.

MarkR
Hi. We will use a VPS for the memcache part to put up specific boxes for it. However, you think that it would be wrong to utilize a disc based for the "inpopular" data or leave it all upto memcache?
Industrial
I think that you should use dedicated real tin. If you have performance problems with VMs, the obvious move is to use real tin. Do not waste your dev effort and introduce bugs by adding pointless caching.Caching of data on disc is not normally useful, because if it's on disc somewhere else already, unless the other disc is massively over contended, it will be no more efficient. A disc IO operation takes how long it takes, regardless of whether it reads the data from a cache or its original location.
MarkR