ansaurus

Question

Database Structure & Hard drive seek time confusion

Answer 1

+3 A:

How "absolutely essential" is seek access? Have you tested your application with a non-optimal solution yet? During that testing, did you benchmark to determine where the real bottlenecks are? If you haven't, you'll be surprised by the results.

Next, try different methods and compare the running times. Test under different system loads (ie, when the system is idle except for your application, and when it is busy).

Consider that your optimizations based on your current hard drive may become incorrect when a new, faster hard drive has different internal optimizations that throw your work out the window.

lacqui 2009-03-01 20:21:24

No i haven't tested the program yet, it is still looking into different file structures. Every millisecond counts, I'm interested in the theoretical maximum. So do you think i need a working test environment to find out? The hard drive maybe under load from another process. Thanks

2009-03-01 20:32:23

If, as you claim, every millisecond counts, try reading the database into memory. You say it's small (you quote 3M), so that should easily fit into your system memory.However, you still have to determine if the speed is a real or imagined requirement; ie WHY do you need the speed?

lacqui 2009-03-01 20:41:29

Very seldom, and only with pathological configurations, have I seen hardware characteristics to be useful for optimizing software performance except in the very short term. And never until after thorough testing. Hardware changes too quickly move up the list of "things to try".

le dorfier 2009-03-01 22:06:53

Answer 2

+1 A:

A sequential read is always faster than one that requires a head seek (not position seek). Typical hard drive perf for sequential read is 50-60 MB/sec, seeking drops that down to a worst-case ~0.4 MB/sec. Once the drive heads are positioned, you essentially get the data in the cylinder for free. The file system cache takes advantage of that by pre-reading sectors from a cylinder.

However, you have no control over the placement of your data on disk cylinders. Nor can you guess the drive geometry. Note that throughput can get significantly worse over time when the volume gets fragmented. You'll need to look for perf by caching data in memory. At that point, you worry about locality of reference.

Hans Passant 2009-03-01 20:25:42

What is the difference between a head and a position seek? Within a file you can't presume that the cylinders will always be adjacent (ext3)? The data is split into 32MB chunks which are read individually, but the volume of chunks means that they can't be cached in memory all at once.

2009-03-01 20:51:59

@unknown, you are confused between the hard disk's seek mechanism and the system call. In practice, you will probably do better calling seek as you won't incur as much memory overhead from calling reads. But this depends on the specifics of your application.

BobbyShaftoe 2009-03-01 21:01:36

@Bobby - Your right i am confused. So does the seek() system call not always move the head? Only when movement to another cylinder is required?

2009-03-01 21:09:48

When a read stream is required to change cylinder, does it incur the same penalty of a seek() system call to that same data? Thanks

2009-03-01 21:13:26

That's a yes and a yes.

Hans Passant 2009-03-01 21:56:44

Answer 3

+4 A:

All seek system call does is changing a position in file where the next read will be. It does not move the drive head. Drive heads move when data is read or written and you don't have direct control over what OS will do next.

Reading lots of data you aren't going to need has impact because all read data needs space in OS buffers and causes older data to be discarded. So using seek over big files will mess with filesystem cache less.

All I write beneath assumes you cannot fit whole database in memory. If you can, just do that. Read everything and try to append new and changed data at the end of file. Don't worry about wasted space, just do some compacting once in a while.

If your database is too big:

Data is read and written to physical drive in blocks (or pages). Similarly the basic unit of disk IO in your OS is page. If OS caches data from disk it's also in whole pages. So thinking whether you need to move forward few bytes using seek or read makes little sense. If you want to make it fast, you need to take into account how disk IO really works.

First, already mentioned by nobugz, locality of reference. If the data you use in each operation is located close together in a file, your OS will need to read or write less pages. On the other hand, if you spread your data, many pages will need to be read or written at once, which will always be slow.

As to data structure for index. Typically they are organized as B-trees. It's a data structure made especially for effective searching of large quantities of data stored in memory with paged reads and writes.

And both strategies for organizing data is used in practice. For example, MS SQL Server by default stores data the first way: data is stored separately and indices only contain data from indexed columns and physical addresses of data rows in files. But if you define clustered index then all data will be stored within this index. All other indexes will point to the data via clustered index key instead of physical address. The first way is simpler but the other may be much more effective if you often do scans of ranges of data based on clustered index.

Tomek Szpakowicz 2009-03-01 21:19:18

Answer 4

A:

You could always map the file into memory and then access it through pointers and such. That should usually make your accesses simpler and faster.

2009-03-02 02:30:18

ansaurus

tags:

views:

answers:

Database Structure & Hard drive seek time confusion

related questions