views:

64

answers:

5

We would like to implement an application architecture in which we have large numbers of databases on the disk, one database file for each customer.

When a user request comes in the database is opened (if it isn't already).

After a certain period of time with no activity, the database is closed automatically by the server, thereby freeing up database server resources.

With this architecture we should be able to have large numbers of databases on the disk, but only a subset of them loaded into the database server at any time.

The problem is that few databases seem to support the concept of automatically closing a database. It appears that maybe MIcrosoft SQL server allows this but we're using all open source technologies so SQL server is not an option.

We'll consider any free or open source database technology at all but I can't see any that support auto close functionality.

Anyone know any different?

UPDATE: looking for a solution based on Linux not Windows.

Thanks

A: 

I have this idea and assuming you are using Windows:

  1. Your database will be running as a service, and each client has his own unique service name.
  2. You write a batch file, that will start/stop that service.
  3. the batch file will be called from your server whenever you want.
Omar Al Kababji
I'm wanting individual databases to be closed, not the entire database server.
Simon Wentley
Yes I mean each database will have its own service, like oracle does.
Omar Al Kababji
A: 

mySql with a cron job.

In addition, mySql has a very small footprint (compared to Sql Server)... one example, is that it does not hog memory (and yes I know one can cap Sql Server memory usage).

mySql also has connection pooling which is very efficient and useful.

Kris Krause
What would the cron job do? Surely it is the database server that knows if a database has been inactive for a certain period of time. I cant see how an external process would know that.
Simon Wentley
If you use MySQL and want InnoDB queries to be fast, then you have to allocate lots of memory to the InnoDB buffer pool. So this may not be a solution.
intgr
A: 

Have you ascertained that this is really an issue? I only mention that because the cost of an open database is likely quite small, specifically "opening" most likely consists of syncing up any outstanding transactions waiting for the database, and doing a basic consistency check (notably loading a couple of pages of data stored on the disk).

Once that is done, without activity, there isn't really a lot of data to maintain on the server.

If you think about it, the most base functionality of a DB system is to manage the caching of database pages with memory. When a request is made for a piece of data, the system locates the actual page its on, and checks RAM to see if it's loaded. If not, it loads it from disk.

If you also notice, the vast bulk of DB "meta" data is stored -- in the database. That means that when the system wants to know anything, it effectively uses itself to locate the information, specifically the data page caching subsystem.

Like any other cache, as data is expired and no longer needed, it gets flush back out to disk and refetched when necessary.

So, this implies that once a database has been "opened", any information really necessary to maintain its state will likely be maintained via the data cache subsystem, and for unused database, released back to disk to make room for current traffic.

This is why I'm curious if you've tested your candidate DBs to see if you run in to issues about this, or if the database even has the concept of "opening a database".

When, as a client, we discuss this, the focus tends to be on connections to the database server. But once all of those are closed, I do not think that the system is going to retain any significant amount of in memory data about a particular database that is inactive.

After all, all (ALL) of the data in a database is stored "the same", a table is a table is a table, an index is an index is an index, particularly on central server where all of the data pages are managed as a single big "soup" of data.

The only issue you might run in to is if your database happens to create a file specifically for each database, and that file remains open. Eventually you may run out of file descriptors.

But most of the modern systems don't do that, they store everything in a big blob of files irrespective of what database or schema they're in (barring specific table space allocations that you make or the server allows, of course).

So, essentially, I don't think this is a problem, as I don't think the modern databases really make the kind of distinctions that you're talking about internally. That multiple databases or schema are a logical artifact within the system, not a technical implementation, and that all of the data pages end up in the same cache and use the same resources regardless of what schema, database, table, or index they come from.

I would do some testing on your database of choice to see if this is a problem. For example, you could create 1M databases, lower the memory as much as possible for the database, and then just starting cycling through them, opening however many at a time you feel is appropriate (10, 100, 1000, whatever) and see if you have any problems.

Finally, I don't "know" any of this for any specific database, it's just gut instinct on how historically databases are implemented.

Will Hartung
DB2 imposes a maximum number of databases per instanceMaximum number of databases per instance concurrently in use 256http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic=/com.ibm.db2.luw.sql.ref.doc/doc/r0001029.htmlAs does SQL server:Databases per instance of SQL Server 32,767http://msdn.microsoft.com/en-us/library/ms143432.aspxSo I assume that other databases also impose a maximum number of databases per instance.
Simon Wentley
Coming back to the gmail analogy that I made in a separate comment in this thread, when you load gmail it makes you wait for a while till its ready to go. I assume that behind the scenes it is going and fetching your mail database from some slow archival storage and putting it on some fast server, where it stays until you haven't done anything for a certain period of time, then it puts it back into slow archival storage. It's an appealing architecture where there are lots of clients infrequently accessing relative small stores of data. Very efficient server utilisation I would think.
Simon Wentley
A: 

I understand that you could have enough customers for a process to get short on file handles. How about a pool of DB connections?

When a user request arrives, see if that user's DB is opened. If so, use the connection and reset a time of last access flag

If that user's DB is not open, open the connection, set the last access time, and use the connection (if there's no available connection, throw an error). Also, fork a process/thread/lightweight process/whatever you call it in your environment that checks:

If the pool has an adequate number of unused connections, the thread is done

If not, scan for the oldest last accessed 5%-25%, or those not used in the last minute/hour/day (whatever is appropriate for your user request pattern) and close them, moving to the unused pool

Make sure you keep enough available connections in the unused pool to handle incoming requests.

mpez0
I'm trying to come up with an architecture in which occasionally used databases are unloaded from the database server when not in use. Keeping all databases in the database server memory consumes lots of resource unnecessarily.
Simon Wentley
The architecture you describe is efficient for connection pooling at the client end, but I'm trying to come up with an efficient way of allocating server resource for acessing a large number of infrequently accessed databases.
Simon Wentley
A: 

I assume that by "closing databases" you mean that they would free their cache memory? Because there really is no benefit to "closing" actual files on disk, their resource usage is negligible.

Some database engines use the operating system's disk cache. MySQL's MyISAM storage engine is an example, but doesn't offer integrity guarantees, so that excludes many of its uses. MySQL's other engines like InnoDB do not offer this, however.

PostgreSQL natively uses the operating system's cache as a second-level cache; while the first-level cache (shared_buffers) still consumes memory all the time, it's common to set it to only 10-25% of your memory even on performance-critical servers. The rest is free for OS-level caching, and will be allocated to the database when needed, and is available to other applications when they need it.

intgr