Why can't DMBSes rely on the OS buffer pool?

views:

answers:

+2 Q:

Why can't DMBSes rely on the OS buffer pool?

Stonebraker's paper (Operating System Support for Database Management) explains that, "the overhead to fetch a block from the buffer pool manager usually includes that of a system call and a core-to-core move." Forget about the buffer-replacement strategy, etc. The only point I question is the quoted.

My understanding is that when a DBMS wants to read a block x it issues a common read instruction. There should be no difference from that of any other application requesting a read.

I'm not looking for generic answers (I got them, and read papers). I seek a detailed answer of the described problem. See http://stackoverflow.com/questions/3171616/does-a-file-read-from-a-java-application-invoke-a-system-call

The operating system disk i/o must be generalised to work for a variety of situations. The DBMS can sometimes gain significant performance using less general code that is optimised to its own needs.

The DBMS does its own caching, so doesn't want to work through the O/S caching. It "owns" the patch of disk, so it doesn't need to worry about sharing with other processes.

Update The link to the paper is a help.

Firstly, the paper is almost thirty years old and is referring to long-obsolete hardware. Notwithstanding that, it makes quite interesting reading.

Firstly, understand that disk i/o is a layered process. It was in 1981 and is even more so now. At the lowest point, a device driver will issue physical read/write instructions to the hardware. Above that may be the o/s kernel code then the o/s user space code then the application. Between a C program's fread() and the disk heads moving, there are at least three or four levels and might be considerably more. The DBMS may seek to improve performance might seek to bypass some layers and talk directly with the kernel, or even lower.

I recall some years ago installing Oracle on a Sun box. It had an option to dedicate a disk as a "raw" partition, where Oracle would format the disk in its own manner and then talk straight to the device driver. The O/S had no access to the disk at all.

Michael J 2010-07-03 15:37:59

I couldn't vote down the answer. I understand the general answer, but I want a detailed answer to the issue described above. Why would doing it's own caching/ the buffer manager be better, in terms of what the quoted stmt describes (I understand access-pattern issues, etc..).

simpatico 2010-07-03 15:42:46

The question has had five edits and changed substantially since I answered it. My answer relates to the original version of the question.

Michael J 2010-07-03 22:08:59

I'm looking for more details. To my understanding the dbms must issue a read just like any other application, (I'm considering the raw partition option). To read it must have an address, and that must be a virtual memory address.

simpatico 2010-07-05 15:44:31

Why must disk i.o be linked to the VM Manager? The DBMS can (if setup that way) give commands directly to the physical disk controller. I'm not sure if any modern DBMSs actually do that, but it is certainly possible.

Michael J 2010-07-06 03:43:58

It's mainly a performance issue. A dbms has highly specific and unusual I/O demands.

The OS may have any number of processes doing I/O and filling its buffers with the assorted cached data that this produces.

And of course there is the issue of size and what gets cached (a dbms may be able to peform better cache for its needs than the more generic device buffer caching).

And then there is the issue that a generic “block” may in fact amount to a considerably larger I/O burden (this depends on partitioning and such like) than what a dbms ideally would like to bear; its own cache may be tuned to work better with the layout of the data on the disk and thereby able to minimise I/O.

A further thing is the issue of indexes and similar means to speed up queries, which of course works rather better if the cache actually knows what these mean in the first place.

2010-07-03 15:41:01

not to the point. I asked about a specific aspect.

simpatico 2010-07-05 15:41:25

The real issue is that the file buffer cache is not in the filesystem used by the DBMS; it's in the kernel and shared by all of the filesystems resident in the system. Any memory read out of the kernel must be copied into user space: this is the core-to-core move you read about.

Beyond this, some other reasons you can't rely on the system buffer pool:

Often, DBMS's have a really good idea about its upcoming access patterns, and it can't communicate these patterns to the kernel. This can lead to lower performance.
The buffer cache is traditional stored in a fixed-size kernel memory range, so it cannot grow or shrink. That also means the cache is much smaller than main memory, so by using the buffer cache a DBMS would be unable to take advantage of system resources.

Andres Jaan Tack 2010-07-03 17:00:49

Oh, hell. I got excited about replacement policies and you explicitely weren't asking about that. I'll see if I can research a better answer.

Andres Jaan Tack 2010-07-03 17:05:43

There we go, and I learned something.

Andres Jaan Tack 2010-07-03 17:44:09

Reading from your other question, and working forward:

When the DBMS must bring a page from disk it will involve at least one system call. At his point most DBMSs place the page into their own buffer. (They also end up in the OS' buffer, but that's unimportant).

So, we have one system call. However, we can avoid any further system calls. This is possible because the DBMS is caching pages in its own memory space. The first thing the DBMS will do when it decides it needs a page is check and see if it has it in its cache. If it does, it retrieves it from there without ever invoking a system call.

The DBMS is free to expire pages in its cache in whatever way is most beneficial for its IO needs. The OS's cache is expired in a more general way since the OS has other things to worry about. One example of this is that a DBMS will typically use a great deal of memory to cache pages as it knows that disk IO is one of the most expensive things it can do. The OS won't do this as it has to balance the cost of disk IO against having memory for other applications to use.

Donnie 2010-07-03 17:55:55

I know this is old, but it came up as unanswered.

Essentially:

The OS uses a separate address spaces for every process.
Retrieving information from any other address space requires a system call or page fault. **(see below)
The DBMS is a process with its own address space.
The OS buffer pool Stonebraker describes is in the kernel address space.

So ... to get data from the kernel address space to the DBMS's address space, a system call or page fault is unavoidable.

You're correct that accessing data from the OS buffer pool manager is no more expensive than a normal read() call. (In fact, it's done with a normal read call.) However, Stonebraker is not talking about that. He's specifically discussing the caching needs of DBMSes, after the data has been read from the disk and is present in RAM.

In essence, he's saying that the OS's buffer pool cache is too slow for the DBMS to use because it's stored in a different address space. He's suggesting using a local cache in the same process (and therefore same address space), which can give you a significant speedup for applications like DBMSes which hit the cache heavily, because it will eliminate that syscall overhead.

Here's the exact paragraph where he discusses using a local cache in the same process:

However, many DBMSs including INGRES [20] and System R [4] choose to put a DBMS managed buffer pool in user space to reduce overhead. Hence, each of these systems has gone to the trouble of constructing its own buffer pool manager to enhance performance.

He also mentions multi-core issues in the excerpt you quote above. Similar effects apply here, because if you can have just one cache per core, you may be able to avoid the slowdowns from CPU cache flushes when multiple CPUs are reading and writing the same data.

** BTW, I believe Stonebraker's 1981 paper is actually pre-mmap. He mentions it as future work. "The trend toward providing the file system as a part of shared virtual memory (e.g., Pilot [16]) may provide a solution to this problem."

Drew Thaler 2010-08-20 18:41:25

ansaurus

tags:

views:

answers:

Why can't DMBSes rely on the OS buffer pool?

related questions