views:

14

answers:

1

I want to get a better understanding of how disk reads work for a simple ls command and for a cat * command on a particular folder.

As I understand it, disk reads are the "slowest" operation for a server/any machine, and a webapp I have in mind will be making ls and cat * calls on a certain folder very frequently.

What are "ball park" estimates of the disk reads involved for an "ls" and for a "cat *" for the following number of entries?

                Disk reads for ls                     Disk reads for cat *
200
2,000
20,000
200,000

Each file entry is just a single line of text

+1  A: 

Tricky to answer - which is probably why it spent so long getting any answer at all.

In part, the answer will depend on the file system - different file systems will give different answers. However, doing 'ls' requires reading the pages that hold the directory entries, plus reading the pages that hold the inodes identified in the directory. How many pages that is - and therefore how many disk reads - depends on the page size and on the directory size. If you think in terms of 6-8 bytes of overhead per file name, you won't be too far off. If the names are about 12 characters each, then you have about 20 bytes per file, and if your pages are 4096 bytes (4KB), then you have about 200 files per directory page.

If you just list names and not other attributes with 'ls', you are done. If you list attributes (size, etc), then the inodes have to be read too. I'm not sure how big a modern inode is. Once upon a couple of decades ago on a primitive file system, it was 64-bytes each; it might have grown since then. There will be a number of inodes per page, but you can't be sure that the inodes you need are contiguous (adjacent to each other on disk). In the worst case, you might need to read another page for each separate file, but that is pretty unlikely in practice. Fortunately, the kernel is pretty good about caching disk pages, so it is unlikely to have to reread a page. It is impossible for us to make a good guess on the density of the relevant inode entries; it might be, perhaps, 4 inodes per page, but any estimate from 1 to 64 might be plausible. Hence, you might have to read 50 pages for a directory containing 200 files.

When it comes to running 'cat' on the files, the system has to locate the inode for each file, just as with 'ls'; it then has to read the data for the file. Unless the data is stored in the inode itself (I think that is/was possible in some file systems with biggish inodes and small enough file bodies), then you have to read one page per file - unless partial pages for small files are bunched together on one page (again, I seem to remember hearing that could happen in some file systems).

So, for a 200 file directory:

  • Plain ls: 1 page
  • ls -l: 51 pages
  • cat *: 251 pages

I'm not sure I'd trust the numbers very far - but you can see the sort of data that is necessary to improve the estimates.

Jonathan Leffler
wow - very nicely explained - and good enough for me! Any inputs on what I can use to determine (a) disk seeks or (b) size of the "pages" you mention above?
JD_ED
@JD_ED: (a) disk seeks - depends on the layout of the pages on disk and the order in which they are read, plus scheduling, plus ... very complex. (b) should be available in a header for the kernel - perhaps <sys/param.h>, or it might depend on the file system in use, in which case, you may have to look in a header specific to the file system you're using.
Jonathan Leffler