ansaurus

Question

Millions of small graphics files and how to overcome slow file system access on XP

Answer 1

A:

You could try an SSD....

http://www.crucial.com/promo/index.aspx?prog=ssd

UpTheCreek 2009-10-28 16:03:41

Answer 2

A:

For XP try this tip:
How to Turn Off Image Preview Thumbnail and Disable Windows Picture and Fax Viewer in Windows XP

On Windows Server 2008 :
Preview thumbnails in Windows Explorer

lsalamon 2009-10-28 16:04:49

I don't think this is the issue, as his problem persists even when listing files from the command line.

snicker 2009-10-28 16:21:10

really, but my suggestion may help just as a configurable option that aids in performance.

lsalamon 2009-10-28 16:50:03

Answer 3

+1 A:

Use more folders and limit the number of entries in any given folder. The time to enumerate the number of entries in a directory goes up (exponentially? I'm not sure about that) with the number of entries, and if you have millions of small files in the same directory, even doing something like dir folder_with_millions_of_files can take minutes. Switching to another FS or OS will not solve the problem---Linux has the same behavior, last time I checked.

Find a way to group the images into subfolders of no more than a few hundred files each. Make the directory tree as deep as it needs to be in order to support this.

JSBangs 2009-10-28 16:07:39

Thanks for the insight into Linux - I was wondering about that. Part of my pain is caused by the fact that the GMapCreator program only outputs to one folder at a time; I have no way to make it use a folder heirarchy. Your answer got me wondering if I could perhaps setup another process which moves the files out into folders as you suggest.

Elliveny 2009-10-28 16:26:03

Answer 4

A:

The solution is most likely to restrict the number of files per directory.

I had a very similar problem with financial data held in ~200,000 flat files. We solved it by storing the files in directories based on their name. e.g.

gbp97m.xls

was stored in

g/b/p97m.xls

This works fine provided your files are named appropriately (we had a spread of characters to work with). So the resulting tree of directories and files wasn't optimal in terms of distribution, but it worked well enough to reduced each directory to 100s of files and free the disk bottleneck.

Brian Agnew 2009-10-28 16:17:12

Downvoted why ?

Brian Agnew 2009-10-28 16:30:36

Answer 5

+2 A:

There are several things you could/should do

Disable automatic NTFS short file name generation (google it)
Or restrict file names to use 8.3 pattern (e.g. i0000001.jpg, ...)
In any case try making the first six characters of the filename as unique/different as possible
If you use the same folder over and (say adding file, removing file, readding files, ...)
- Use contig to keep the index file of the directory as less fragmented as possible (check this for explanation)
- Especially when removing many files consider using the folder remove trick to reduce the direcotry index file size
As already posted consider splitting up the files in multiple directories.

.e.g. instead of

directory/abc.jpg
directory/acc.jpg
directory/acd.jpg
directory/adc.jpg
directory/aec.jpg

use

directory/b/c/abc.jpg
directory/c/c/acc.jpg
directory/c/d/acd.jpg
directory/d/c/adc.jpg
directory/e/c/aec.jpg

jitter 2009-10-28 16:19:29

Oh my.. I am amazed how much difference that disabling the 'automatic NTFS short file name generation' suggestion has made - a complete world of difference! Thanks for this. My data is all in the UK and so the filenames all begin with the same 9 or 10 letters, based upon the Google maps naming requirements, and so the short filename generation was obviously a HUGE overhead.

Elliveny 2009-10-28 17:22:39

Did you consider to accept the answer?

jitter 2009-11-16 17:13:50

Yes I did - sorry it took me a while to get there! Thanks for your very useful checklist.

Elliveny 2010-06-04 10:22:00

Answer 6

A:

One solution is to implement haystacks. This is what Facebook does for photos, as the meta-data and random-reads required to fetch a file is quite high, and offers no value for a data store.

Haystack presents a generic HTTP-based object store containing needles that map to stored opaque objects. Storing photos as needles in the haystack eliminates the metadata overhead by aggregating hundreds of thousands of images in a single haystack store file. This keeps the metadata overhead very small and allows us to store each needle’s location in the store file in an in-memory index. This allows retrieval of an image’s data in a minimal number of I/O operations, eliminating all unnecessary metadata overhead.

brianegge 2009-11-02 04:36:10

Nice! I implemented something similar to Haystack. Is Haystack Open Source? http://github.com/fictorial/logstore

z8000 2010-01-04 02:17:37

ansaurus

tags:

views:

answers:

Millions of small graphics files and how to overcome slow file system access on XP

related questions