views:

291

answers:

4

I have millions of audio files, generated based on GUId (http://en.wikipedia.org/wiki/Globally%5FUnique%5FIdentifier). How can I store these files in the file-system so that I can efficiently add more files in the same file-system and can search for a particular file efficiently. Also it should be scalable in future.

Files are named based on GUId (unique file name).

Eg:

[1] 63f4c070-0ab2-102d-adcb-0015f22e2e5c

[2] ba7cd610-f268-102c-b5ac-0013d4a7a2d6

[3] d03cf036-0ab2-102d-adcb-0015f22e2e5c

[4] d3655a36-0ab3-102d-adcb-0015f22e2e5c

Pl. give your views.

PS: I have already gone through < http://stackoverflow.com/questions/446358/storing-a-large-number-of-images >. I need the particular data-structure/algorithm/logic so that it can also be scalable in future.

EDIT1: Files are around 1-2 millions in number and file system is ext3 (CentOS).

Thanks,

Naveen

+1  A: 

I would try and keep the # of files in each directory to some manageable number. The easiest way to do this is name the subdirectory after the first 2-3 characters of the GUID.

cletus
+9  A: 

That's very easy - build a folder tree based on GUID values parts.

For example, make 256 folders each named after the first byte and only store there files that have a GUID starting with this byte. If that's still too many files in one folder - do the same in each folder for the second byte of the GUID. Add more levels if needed. Search for a file will be very fast.

By selecting the number of bytes you use for each level you can effectively choose the tree structure for your scenario.

sharptooth
If performance is critical, it'd be a good idea to benchmark different numbers of files in each directory.
Mark Bessey
If you have a two-level, 256-ary directory structure (such that file 1 is stored in `63/63f4/63f4c070-...`), then with 2 million files you'll get about 30 in each leaf directory - which should perform quite well and scale moderately well.
caf
@Sharptooth: Can you please explain using an example so that it will give me a much more clear picture.
Naveen
@Naveen: Let's assume you will use two levels, one byte for each. For any GUID you get you create a folder on the top level and another one in the first folder. So for 7A09BF85-9E98-44ea-9AB5-A13953E88C3D you create 7A and 7A/09 folders and put the file into 7A/09 folder. If you search for 7A09BF85-9E98-44ea-9AB5-A13953E88C3D you look whether 7A/09/7A09BF85-9E98-44ea-9AB5-A13953E88C3D file exists.
sharptooth
Thank You Sharptooth :-)
Naveen
A: 

Naveen - two questions:

  1. What orders of magnitude are we talking about? When you say "millions" do you mean 1-5 million, 5-50 million, 50-500 million, or 500-999 million? It matters.

  2. What file system? *nix? Win32? Win64? Other?

Rip Rowan
@Rip: Files are around 1-2 millions in number and file system is ext3 (CentOS).
Naveen
A: 

Sorting the audio files into separate subdirectories may slower if dir_index is used on the ext3 volume. (dir_index: "Use hashed b-trees to speed up lookups in large directories.")

This command will set the dir_index feature: tune2fs -O dir_index /dev/sda1

sambowry