views:

96

answers:

2

I'm writing a database-style thing in C (i.e. it will store and operate on about 500,000 records). I'm going to be running it in a memory-constrained environment (VPS) so I don't want memory usage to balloon. I'm not going to be handling huge amounts of data - perhaps up to 200MB in total, but I want the memory footprint to remain in the region of 30MB (pulling these numbers out of the air).

My instinct is doing my own page handling (real databases do this), but I have received advice saying that I should just allocate it all and allow the OS to do the VM paging for me. My numbers will never rise above this order of magnitude. Which is the best choice in this case?

Assuming the second choice, at what point would it be sensible for a program to do its own paging? Obviously RDBMsses that can handle gigabytes must do this, but there must be a point along the scale at which the question is worth asking.

Thanks!

+9  A: 

Use malloc until it's running. Then and only then, start profiling. If you run into the same performance issues as the proprietary and mainstream "real databases", you will naturally begin to perform cache/page/alignment optimizations. These things can easily be slotted in after you have a working database, and are orthogonal to having a working database.

Matt Joiner
only sensible answer. "Premature optimization is root of all evil".
Alexandre C.
Everybody likes that line. Here I tend to agree with @Matt Joiner but there is a time to just start out with your own paging. If he know he was going to work with large data sets then it really might make sense to start out with paging rather than wasting time on a naive solution.
BobbyShaftoe
Nahhh, there's no guarantee that your first "implementation" will even be faster than the "default". You might as well start at the bottom, which may be the top, or most of the up already for all you know until you begin profiling and tweaking from a working solution. However do keep in mind, there's absolutely no good reason you can't carefully design it, keeping performance, and algorithmic optimizations in mind from the get go. But I wouldn't let that incur a development time cost until a working solution could provide a control for comparisons.
Matt Joiner
+3  A: 

The database management systems that perform their own paging also benefit from the investment of huge research efforts to make sure their paging algorithms function well under varying system and load conditions. Unless you have a similar set of resources at your disposal I'd recommend against taking that approach.

The OS paging system you have at your disposal has already benefit from tuning efforts of many people.

There are, however, some things you can do to tune your OS to benefit database type access (large sequential I/O operations) vs. the typical desktop tuning (mix of seq. and random I/O).

In short, if you are a one man team or a small team, you probably should make use of existing tools rather than trying to roll your own in that particular area.

Amardeep
Well, at this point you don't need to have a full team of researchers at your disposal to write a basic paging system. Sure, there have been many papers written but that's true about anything in CS. It's not really that bad to get something basic going it's just a good bit more work if you really didn't need it in the first place.
BobbyShaftoe