tags:

views:

499

answers:

5

Hello all,

One line of background: I'm the developer of Redis, a NoSQL database (http://code.google.com/p/redis). One of the new features I'm implementing is Virtual Memory, because Redis takes all the data in memory. Thanks to VM Redis is able to transfer rarely used objects from memory to disk, there are a number of reasons why this works much better than letting the OS do the work for us swapping (redis objects are built of many small objects allocated in non contiguous places, when serialized to disk by Redis they take 10 times less space compared to the memory pages where they live, and so forth).

Now I've an alpha implementation that's working perfectly on Linux, but not so well on Mac OS X Snow Leopard. From time to time, while Redis tries to move a page from memory to disk, the redis process enters the uninterruptible wait state for minutes. I was unable to debug this, but this happens either in a call to fseeko() or fwrite(). After minutes the call finally returns and redis continues working without problems at all: no crash.

The amount of data transfered is very small, something like 256 bytes. So it should not be a matter of a very big amount of I/O performed.

But there is an interesting detail about the swap file that's target of the write operation. It's a big file (26 Gigabytes) created opening a file with fopen() and then enlarged using ftruncate(). Finally the file is unlink()ed so that Redis continues to take a reference to it, but we are sure that when the Redis process will exit the OS will really free the swap file.

Ok that's all but I'm here for any further detail. And btw you can even find the actual code in the Redis git, but it's not trivial to understand in five minutes given that's a fairly complex system.

Thank you very much for any help.

A: 

Have you turned off file caching for your file? i.e. fcntl(fd, F_GLOBAL_NOCACHE, 1)

ergosys
No, It's a good idea of the OS cache can cache the file when there is free memory in the system. Actually one of the legitimate usages is to trade CPU cycles for memory, as data stored in VM is much smaller but slower to access.So in theory it should be be a normal file, but if you think this could be the problem I can try it actually. I'll report back my findings. Thanks for the answer.
antirez
A: 

As Linus said once on the Git mailing list:

"I realize that OS X people have a hard time accepting it, but OS X filesystems are generally total and utter crap - even more so than Windows."

Ggolo
Amusing, but not a helpful answer.
sbooth
A: 

Have you tried debugging with DTrace and or Instruments (Apple's experimental dtrace front-end)?

Exploring Leopard with DTrace

Debugging Chrome on OS X

delano
I tried dtruss in order to see the calls, without too succes, no hints about why it takes so long. Probably it's some blocking thing the OS is doing like materializing part of the file on disk after the ftruncate? I'll try more and thanks for the links and the answer.
antirez
+6  A: 

As I understand it, HFS+ has very poor support for sparse files. So it may be that your write is triggering a file expansion that is initializing/materializing a large fraction of the file.

For example, I know mmap'ing a new large empty file and then writing at a few random locations produces a very large file on disk with HFS+. It's quite annoying since mmap and sparse files are an extremely convenient way of working with data, and virtually every other platform/filesystem out there handles this gracefully.

Is the swap file written to linearly? Meaning we either replace an existing block or write a new block at the end and increment a free space pointer? If so, perhaps doing more frequent smaller ftruncate calls to expand the file would result in shorter pauses.

As an aside, I'm curious why redis VM doesn't use mmap and then just move blocks around in an attempt to concentrate hot blocks into hot pages.

Jason Watkins
Hello Jason. Yes this was my idea too: that for some reason after the ftruncate() and after a few writes, at some point the HFS+ implementation thinks it's time to materialize a huge part of the file.The pages are allocated incrementally. I use an algorithm similar to the one of the Linux Kernel. I try to allocate incrementally up to a given number of pages, than I return to the start of the file from time to time searching for free contiguous blocks.So incremental ftruncates() are a good idea AFAIK. I thought at it but avoid it to tell "out of space" at startup on full disk needed.
antirez
I wonder, does ftruncate() actually reserve the filespace even on systems that support sparse files?Also: I've heard that Apple has started work on a new filesystem, not derived from HFS. Until they do so OSX will never be usable for servers, and will be annoying for developers deploying to linux/solaris/etc.
Jason Watkins
After trying with a smaller file, the bug disappeared. So I think your answer is right, after ftruncate the first writes are probably materializing the file. Given that everybody runs Redis on Linux for production this is not a big problem, but it's better to know :) Thanks
antirez
+1  A: 

antirez, I'm not sure I'll be much help since my Apple experience is limited to the Apple ][, but I'll give it a shot.

First thing is a question. I would have thought that, for virtual memory, speed of operation would be a more important measure than disk space (especially for a NoSQL DB where speed is the whole point, otherwise you'd be using SQL, no?). But, if your swap file is 26G, maybe not :-)

Some things to try (if possible).

  1. Try to actually isolate the problem to the seek or write. I have a hard time believing a seek could take that long since, at worst, it should be a buffer pointer change. Still, I didn't write OSX so I can't be sure.
  2. Try adjusting the size of the swap file to see if that's what is causing the problem.
  3. Do you ever dynamically expand the swap file (as opposed to pre-allocation)? If you do, that may be what is causing the problem.
  4. Do you always write as low in the file as you can? It may be that creating a 26G file may not actually fill it with data but, if you create it then write to the last byte, the OS may have to zero out the bytes before then (deferring the initialization, if any).
  5. What happens if you just pre-allocate the entire file (write to every byte) and not unlink it? In other words, leave the file there between runs of your program (creating it if it doesn't already exist of course). Then in your startup code for Redis, just initialize the file (pointers and such). This may get rid of any problems like those in point 4 above.
  6. Ask on the various BSD sites as well. I'm not sure how much Apple changed under the covers but OSX is just BSD at the lowest level (Pax ducks for cover).
  7. Also consider asking on the Apple sites (if you haven't already done so).

Well, that's my small contribution, hopefully it'll help. Good luck with your project.

paxdiablo
Hello, your comment is great! Thank you very much for it.About the size, indeed the whole point is the speed, but there are many datasets where just 5% of the whole dataset is usually actively used, so a big swap file can be handy sometimes. In Redis the user is able to configure both the swap file size (the page size, and the number of pages actually) and the amount of RAM that Redis can use, so it's a matter of tuning very well the system for your dataset.Btw: 1) good idea. 2) indeed, this may confirm if is actual file allocaiton time. 3) it's hard to recover from out of space but...
antirez
antirez
antirez, re "5) startup time could be too large": I was suggesting you do this *once* the first time you run the program and leave the swap file there in between runs. That way, subsequent runs won't have to create the file. They'd still have to initialize it but hopefully that would be a case of just writing a few pointer-type values or zero counts into the start of it.
paxdiablo
That way, you always have a swap file when your program starts - there's no possibility that the OS will lazy-create bits of the file as you're using it. You still need code to expand the swap if you run out of space but that is the case no matter what. The swap maintains it's largest size ever achieved (if you want to reduce it, just delete the file either outside Redis so that it's recreated when you run, or have an option to let Redis re-create on startup whetegher or not the swap file exists).
paxdiablo
Oh, got it, this is also viable indeed. The swap file does not need any initialization at all as the pages table is taken in memory for performances. So actually Redis could just check if the file is there. Also the current approach is lame as it creates the swap file in /tmp. The user want it to be created where there is a fast drive like an SSD or something alike. So... actually "named" swap file is better. vm-swap-file <...>. Thanks for the good idea!
antirez