views:

143

answers:

5

Hi folks, I have an interesting database problem. I have a DB that is 150GB in size. My memory buffer is 8GB.

Most of my data is rarely being retrieved, or mainly being retrieved by backend processes. I would very much prefer to keep them around because some features require them.

Some of it (namely some tables, and some identifiable parts of certain tables) are used very often in a user facing manner

How can I make sure that the latter is always being kept in memory? (there is more than enough space for these)

More info: We are on Ruby on rails. The database is MYSQL, our tables are stored using INNODB. We are sharding the data across 2 partitions. Because we are sharding it, we store most of our data using JSON blobs, while indexing only the primary keys

Update 2 The tricky thing is that the data is actually being used for both backend processes as well as user facing features. But they are accessed far less often for the latter

Update 3 Some people are commenting than 8Gb is toy these days. I agree, but just increasing the size of the db is pure LAZINESS if there is a smarter, efficient solution

+3  A: 

This is why we have Data Warehouses. Separate the two things into either (a) separate databases or (b) separate schema within one database.

  1. Data that is current, for immediate access, being updated.

  2. Data that is historical fact, for analysis, not being updated.

150Gb is not very big and a single database can handle your little bit of live data and your big bit of history.

Use a "periodic" ETL process to get things out of active database, denormalize into a star schema and load into the historical data warehouse.

S.Lott
Can you add more detail about how a "separate schema" within one database means? I am not too familiar with this - thanks!
ming yeow
@S. Lott: It's not clear (yet) from the question whether the data used by backend processes is for reporting/data mining or perhaps just non-realtime processing. Agree 100% with using a DW rather than process the transactional DB if they are processing historical data.
Eric J.
thanks! updated the question again as per your comment
ming yeow
I agree, trying to have a single source for analysis needs and low-latency web needs often results in sub-optimal for both.
rfusca
Thanks for the great tip! Datawarehousing is a little too much overkill for me at the moment. I think what i need can be achieved by giving mysql some hints on what to cache, and then breaking up some tables and tell mysql not to cache those
ming yeow
@ming yew: "Data Warehousing" can overkill if you do it wrong. Splitting your data into 'active' and 'history' is simple, quick, efficient, and still a kind of data warehousing.
S.Lott
@ming: You can improve things without hardly any effort by replicating your transactional data to a slave DB and running the background queries off of the slave (assuming they read and don't alter data... the same assumption you make with a data warehouse).
Eric J.
A: 

So, what is the problem?

First, 150gb is not very large today. It was 10 years ago.

Second any non-total-crap database system will utilize your memory as cache. If the cache is big enough (compared to the amount of data that is in use) it will be efficient. If not, the only thing you CAN do is get more memory (because, sorry, 8gb of memory is VERY low for a modern server - it was low 2 years ago).

You should not have to do anything for the memory to be efficiently used. At least not on a commercial level database - maybe mysql sucks, but I would not assume this.

TomTom
Well, the problem is that the lots of disparate bits of data gets accessed very often by BACKEND PROCESSES, swapping out the data that I need to be in memory because it is USER FACING.
ming yeow
Backend processes are usually a lot more patient than users. They will wait while their data is retrieved from disk.
Eric J.
And on top. Either you only touch part of the data or not. If the backend processes start table scans, then basically - you need more memory. Point. Nothing helps. More memory (what about a REAL server -like 32gb?) and a FAST disc subsystem on top. 8gb is toy category for a database these days.
TomTom
8Gb is toy these days >> I agree, but just increasing the size of the db is pure LAZINESS if there is a smarter, efficient solutionYou do not go "increase DB" at everything you do
ming yeow
The innodb buffer cache is a smart, efficient, well-tuned solution written by smart, efficient developers. There's nothing lazy about it. If you want an absolute guarantee that certain tables will always be in the buffer cache, just read them periodically.
Seun Osewa
A: 

With MySQL, proper use of the Query Cache will keep frequently queried data in memory. You can provide a hint to MySQL not to cache certain queries (e.g. from the backend processes) with the SQL_NO_CACHE keyword.

If the backend processes are accessing historical data, or accessing data for reporting purposes, certainly follow S. Lott's suggestion to create a separate data warehouse and query that instead. If a data warehouse is too much to accomplish in the short term, you can replicate your transactional database to a different server and perform queries there (a Data Warehouse gives you MUCH more flexibility and capability, so go down that path if possible)

UPDATE:

UPDATE 2:

I confirmed with MySQL support that there is no mechanism to selectively cache certain tables etc. in the innodb buffer pool.

Eric J.
That "SQL_NO_CACHE" looks extremely useful. Could it be used for 1) avoid caching a table?2) avoid caching a SELECT query based on the conditions? Let me know if it should be a seperate question. BTW - you have been amazingly helpful - THANKS!
ming yeow
@Ming: You can use it to explicitly avoid caching a query or explicitly request caching depending on the setting of query_cache_type. I'm not aware of a way to keep certain tables in the innodb_buffer_pool, however, which is essentially your question 1. Added a few links for more details on your question 2.
Eric J.
The query cache does nothing to keep the data in memory, if a novel query queries the same data, it does nothing. In many cases it is more efficient to dedicate the memory to the innodb buffer pool instead of the query cache.
MarkR
@MarkR: The query cache keeps data from repeated queries in memory. That's the point of the cache. The issue with the innodb buffer pool is that background processes with low priority are kicking things out of the innodb buffer pool. Ideally he would want to find a way to mark which tables or portions of tables are eligible for the innodb buffer pool, but I don't know a way to achieve that. The next best thing that I'm aware of is tuning the query cache so that at least frequent queries from the online systems are in cache. With RAM much smaller than data, innodb buffer misses are likely.
Eric J.
The buffer cache is what you want, not the query cache.
Seun Osewa
@Seun: Exactly which "buffer cache" do you mean? That term is used loosely to refer to a number of different caches both within MySQL and at the OS level.
Eric J.
http://dev.mysql.com/doc/refman/5.1/en/innodb-buffer-pool.htmlTo make it scan resistant, you can go here:http://dev.mysql.com/doc/innodb-plugin/1.1/en/innodb-performance-midpoint_insertion.html
Seun Osewa
@Seun: Good link regarding scan resistance. It's not clear from the post whether the back-end processes are updating data vs. performing selects on it.
Eric J.
+1  A: 

If the number of columns used in the customer facing tables are small you can make indexes with all the columns being used in the queries. This doesn't mean that all the data stays in memory but it can make the queries much faster. Its trading space for response time.

+1  A: 

This calls for memcached! I'd recommend using cache-money, a great ActiveRecord write-through caching library. The ngmoco branch has support for enabling caching per-model, so you could only cache those things you knew you wanted to keep in memory.

You could also do the caching by hand using $cache.set/get/expire calls in controller actions or model hooks.

jamiew
thanks! that is a great tip. do you know if cache-money works with sharded databases? We are not using activerecord per se
ming yeow
Yep, I'm happily using cache-money on top of the db_charmer library which handles my sharding/replication setup[1]. YMMV depending on how you are hooking into AR to do the sharding -- happy to provide pointers to the appropriate cache-money internals if you run into issues.[1] http://github.com/kovyrin/db-charmer
jamiew