views:

1176

answers:

12

I have a process that's going to initially generate 3-4 million PDF files, and continue at the rate of 80K/day. They'll be pretty small (50K) each, but what I'm worried about is how to manage the total mass of files I'm generating for easy lookup. Some details:

  1. I'll have some other steps to run once a file have been generated, and there will be a few servers participating, so I'll need to watch for files as they're generated.
  2. Once generated, the files will be available though a lookup process I've written. Essentially, I'll need to pull them based on an order number, which is unique per file.
  3. At any time, an existing order number may be resubmitted, and the generated file will need to overwrite the original copy.

Originally, I had planned to write these files all to a single directory on a NAS, but I realize this might not be a good idea, since there are millions of them and Windows might not handle a million-file-lookup very gracefully. I'm looking for some advice:

  1. Is a single folder okay? The files will never be listed - they'll only be retrieved using a System.IO.File with a filename I've already determined.
  2. If I do a folder, can I watch for new files with a System.IO.DirectoryWatcher, even with that many files, or will it start to become sluggish with that many files?
  3. Should they be stored as BLOBs in a SQL Server database instead? Since I'll need to retrieve them by a reference value, maybe this makes more sense.

Thank you for your thoughts!

+4  A: 

To answer your questions:

  1. I wouldn't store them in a single folder. As chances are at some point you will want to look at the actual files on the disk, rather then some other way.
    Instead why not store them in seperate directories, split into batches of 1000? Possibly using the ID as a key.
  2. That many files will probably flood the DirectorWatcher, so some will be lost. I've used this in the past, and past a certain point (afew hundred), i've found it starts to miss files. Possibly use a different directory for incoming files, and then process this every so often. This can then trigger a process to update the original.
  3. I wouldn't store the documents in a database, but definately store metadata in a database.
Bravax
#2 was one thing I was afraid of - the files won't be generated too quickly (my generation process will only handle 10-20 at a time, and I expect to get a new file about once/second), but I still though that the watcher might misbehave at some point.
rwmnau
If you have SQL Server 2008, look into #3 and storing everything in the database, with the file as a FILESTREAM. This way you can ensure the DB never gets out of sync with the file system. http://technet.microsoft.com/en-us/library/bb933993.aspx
sixlettervariables
A: 

Question:

Why do these documents need to be generated and stored as PDFs?

If they can be generated, why not just keep the data in the database and generate them on the fly when required? This means you can search the actual data that's required for searching anyway and not have the files on disk. This way you can also update the PDF template when required without the need to regenerate anything?

Mark Redman
They're contracts with electronic signatures, and legal has told us we can't generate them on the fly, so that's out :( Even besides that, we'd like the system to be as responsive as possible, and retrieving a copy is hopefully quicker than making one.
rwmnau
A: 

1) This goes totally contrary to what I typically preach, but you may want to store them in a SQL database since they are trully small files. SQL Server would also allow you to quickly and easily find the files you need without any crazy disk trashing normally associated with enumerating such a large directory. Also, storing the files in SQL (while I am generally against) would greatly ease the backup / restore process.

2) Store them all in directories and either index them with windows indexing service (shivers) or create your own index in SQL Server which would contain the file name and full path. I would suggest storing them in separate directories, with only a few tens of thousands of files each. Perhaps you could use the order year as the folder name?

Regardless of how their stored – do not scan the directory to find the files - you will definitely need to have an index of some sort.

Hope this helps!

NTDLS
I am going to keep the meta-data in a database for sure - I won't be scanning the folder, since that gave me some shivers too. I just wanted to see if there are any processes that are known to choke with folders that large.
rwmnau
+1  A: 

Determine some logical ordering of subdirectories and store them in blocks of no more than 512 or so files in a folder.

Do not store the files in a database. Databases are for data, file servers are for files. Store them on a file server, but store the path and retrieval information in a database.

Beth
I know that in the days of DOS, 512 was a "magic number" for files in directories. Does it still apply?
Philip Kelley
I found this answer - http://stackoverflow.com/questions/671260/tips-for-managing-a-large-number-of-files/671273#671273 - is that what you're advocating?
rwmnau
MS's SQL Server may not be the best database to store files in, however, Oracle's SecureFiles in 11g works as advertised and very well. I'd imagine MS will pick up this sort of tech soon on their end, nullifying the previous mantra of "no files in a database".
sixlettervariables
Ok, turns out I was right...SQL Server 2008 contains FILESTREAM, which I haven't used, but would be a worthwhile approach. You could ensure that the NTFS files and the DB never get out of sync.
sixlettervariables
I don't know that 512 is a limit on your existing system, but you can play with that with a 1-character file (or a million of them) and see what happens. I wasn't advocating counting the same number of files in each directory, but if all you index them with is a sequential order #, then that's best.Certainly, if you can get all the files in one folder, go with it, but test it first.And don't store files in your database. Store files on a file server. Store paths to files on a database server. But again, YMMV, so test it out.
Beth
FILESTREAM stores files on a file server. Once you get to a point where your application is used frequently, the disconnect between the file server and DB become obnoxious to manage. The performance in Oracle's SecureFiles (and I'm extrapolating that FILESTREAM from MS is the same) is only slightly less than outside the DB. Moreover, managing the record which contains your file manages the file itself. Deleting the record deletes the file atomically. Always a good thing. 2-3 years ago I'd have said "no files in databases," but the data now points the other way.
sixlettervariables
Also, consider using an imaging server. You can have one of FileNet's sales guys come out and talk to you about it. Their business is handling massive amounts of files and images. Probably a more expensive solution than your problem can justify, but it's good to understand the option.
Beth
+3  A: 

I'd group the files in specific subfolders, and try to organize them (the subfolders) in some business-logic way. Perhaps all files made during a given day? During a six-hour period of each day? Or every # of files, I'd say a few 1000 max. (There's probably an ideal number out there, hopefully someone will post it.)

Do the files ever age out and get deleted? If so, sort and file be deletable chunk. If not, can I be your hardware vendor?

There's arguments on both sides of storing files in a database.

  • On the one hand you get enhanced security, 'cause it's more awkward to pull the files from the DB; on the other hand, you get potentially poorer performance, 'cause it's more awkward to pull the files from the DB.
  • In the DB, you don't have to worry about how many files per folder, sector, NAS cluster, whatever--that's the DB's problem, and probably they've got a good implementation for this. On the flip side, it'll be harder to manage/review the data, as it'd be a bazillion blobs in a single table, and, well, yuck. (You could partition the table based on the afore-mentioned business-logic, which would make deletion or archiving infinitely easier to perform. That, or maybe partitioned views, since table partitioning has a limit of 1000 partitions.)
  • SQL Server 2008 has the FileStream data type; I don't know much about it, might be worth looking into.

A last point to worry about is keeping the data "aligned". If the DB stores the info on the file along with the path/name to the file, and the file gets moved, you could get totally hosed.

Philip Kelley
A: 

My file database contains over 4 million folders, with many files in each folder.

Just just tossed all the folders in one directory. NTFS can handle this without any issue, and advanced tools like robocopy can help when you need to move it.

Just make sure you can index the files without a scan. I did this by tossing my index in a mysql database.

So to get a file I search the mysql database upon some metadata and get an index. Then I use this index to read the file directly. Scaled well for me so far. But do note that you will be turning everything into random access and hence random read/writes. This is poor performance for HDD, but fortunately SSD will help a lot.

Also, I wouldn't toss the files into the mysql database. You won't be able to do network reads without having a client that understand mysql. Right now I can access any file over the network using any program because I can just use its network URL.

Pyrolistical
+2  A: 

1) A simple folder may be acceptably fast with an separate index but as it's trivial to put it in subdirectories which would allow yourself the ability to browse just do that.
So now you have to figure out your naming convention. Although I'd normally suggest a hash to get an even distribution of ids but as you're doing so much it probably makes sense to use the values you've already got. If you've got an order number do you have a timestamp too? If so, just prefix the order number with a timestamp.

Just be aware that if you're using order ids you may experience http://en.wikipedia.org/wiki/Benford%27s_law

+1 for Benford's law - that's fascinating.
rwmnau
Benford's law, my interpretation: there's more starts than ends. Or, more broadly, more beginnings than endings. Pretty true, I think.
Beth
+2  A: 

You can easily organize files into multiple folders without having to do this by business logic, or order-per-day, which is especially nice if that kind of ordering would be 'clumpy' (many hits in one folder, few in others).

The easiest way to do this is to create a unique hash for the file name, so that maybe you get something like this:

sf394fgr90rtfofrpo98tx.pdf

Then break this up into two-character blocks, and you will get this:

sf/39/4f/gr/90/rt/fo/fr/po/98/tx.pdf

As you can see, it gives you a deep directory tree that you can easily navigate.

With a good hash function, this will be very evenly distributed, and you will never get more than 1296 entries per directory. If you ever get a collision (which should be extremely rare), just add a number to the end: tx.pdf, tx_1.pdf, tx_2.pdf. Again, collisions on such large hashes should be extremely rare, so that the kind of clumping you get because of this are a non-issue.

You said that the documents are digitally signed, so you probably have the hash you need right there in form of the signature string.

+1  A: 

You need to test it. All of these solutions depend on the underlying filesystem. Some filesystems can handle huge directories, some can't. Some file systems index their directories, some don't (these two points are not necessarily related).

Breaking things up in to a tree of directories has reasonable chance to be performant, simply because, in the end, the individual directories tend to have few overall entries. That works for most any filesystem, simply because even a "stupid" one that is doing a linear directory search for your file can search a couple hundred entries reasonably quickly.

If the filesystem is indexing the directories (like, say, a btree, or simply sorting it internally which is effectively the same thing in this context), then the directory sizes are less important, although some tools may complain (loading a Windows Explorer window with 4M files, who know what will happen).

So, I would research your planned operating system and the filesystem options, and test it and see which works best for you.

Will Hartung
+1  A: 

Why not considering Storing all those files after been converted into PDF into the DB (blob) Hence Advantages:

  1. I beleive you wont have to deal direclty with the OS I/O, and leave everything up to the DB.
  2. No need to hash naming
  3. Easy to backup and maintain
ZeroCool
A: 

I think like so many others have said, you should do sub-folders but in a way that you can find the data through code. For Example, if datetime works, use that. From reading what you said it would appear that there is some form of hierarchical structure to the reports (daily, weekly, Daily X report, hourly Y report, etc) I would look at the structure of when and why the reports are generated and build my directories up that way.

mcauthorn
A: 

I stored .bmp files in SQL many years ago as blobs. Then you query using a name field that is associated with that blob. Then you can index the name field. Even a clustered index if you don't insert into tne middle of the name field too often. That was back in the SQL 7 days but the newer version of SQL are even more optimized for this kind of thing.

I admit I did not have to store millions of blobs but at a conceptual level I would give that a try. As you have found, NTFS does not handle millions of files very well.

schraj