views:

281

answers:

6

What is the fastest way (Algorithm) to generate 500,000 static html files from DB?

And is it a good practice to put all this files in single folder? or create hierarchically for this files?

We want to handle about 6,000,000 concurrent hit, so the static files will be a good solution for that. the source DB will be simple flat table without JOINS.

We want to generate this files from single table contains 500k records. the file names will be first field from this table. the HTML file will contains to display the data about 900 byte.

A: 

There is a file limit (atleast in linux, around 32k items) so no I wouldn't think it's smart to do that.

NTFS has a limit of 4,294,967,295 files in a folder.

Ólafur Waage
OK, what if we use NTFS?
Ammroff
that's not "linux", that's EXT3. Linux supports other file systems that don't have the limit (XFS for example).
andri
I mean using NTFS under windows 2003 or 2008.
Ammroff
NTFS may have a high theoretical limit, but in practice it bogs way, way down if you get more than a few thousand files in a single directory.
Joe White
Is that a limitation of NTFS or a limitation of Explorer.exe though?
Michael Stum
@Michael not sure, i found this other link that has more detail. http://technet.microsoft.com/en-us/library/cc781134%28WS.10%29.aspx
Ólafur Waage
+3  A: 

Even if your file system can "cope" with 500,000 files in a single directory, it's unlikely to be able to perform well. Even if it can perform well, it's likely to be hard for humans to manage those files.

I'd definitely put them in a hierarchy.

As for the fastest way to generate them - you've asked for an algorithm, but without stating what you want it to do. There are any number of technologies you might want to use - whichever you're most comfortable with is probably the best bet - and any number of ways of approaching the task, depending on what it really consists of.

Jon Skeet
We want to generate this files from single table contains 500k records. the file names will be first field from this table.the HTML file will contains <table> to display the data . about 900 byte.
Ammroff
So what sort of "algorithm" are you thinking of? It seems like a pretty simple batch job to me. Read data, write HTML, read data, write HTML...
Jon Skeet
+1  A: 

Hierarchically would be best for performance because many applications will loop though all the files in a single directory. For instance, Windows Explorer. And that will make the application slow.

The fastest way to extract them would be to write a small C program using the database's headers and fwrite() etc...

F.Y.I.

NTFS can hold 4,294,967,295 files: http://en.wikipedia.org/wiki/NTFS EXT3 can hold VolumeByteSize/2^13 files: http://en.wikipedia.org/wiki/Ext3#cite_note-0

Robert
+1  A: 

Why not just store the generated HTML in the database? It seems like you'll effectively be treating the file system as a database anyway - At least if you store the HTML in a database you can rely on the DBMS to optimise lookup performance (e.g. by caching recently queried HTML) and you can add indices and analyse query performance. Otherwise you'll just be hammering the file system instead; i.e. moving the problem elsewhere.

Also, I would suggest taking a step back and seeing where the bottleneck currently lies. Storing HTML (presentation layer data) is not an elegant solution - If the real problem is due to query performance perhaps consider introducing denormalised tables into your schema containing intermediate results, from which you can quickly generate HTML.

Adamski
Aren't most filesystem drivers tuned for caching files? When all you have is a RDBMS, all your problems look like a query. http://google.com/search?q=filesystem+tuning
Roger Pate
+1  A: 

If I were to do this, I'd store the generated files in a hierarchy, based on the file name (IFF the filenames are sufficiently well distributed), so "onefile.html" gets stored in "o/n/e/onefile.html" and "anotherfile.html" as "a/n/o/anotherfile.html". Using three levels of storage isn't necessary, you may require four. Also, chunking the pathnames on a per-character basis may not be the best distribution, you may be better off using two or three characters, depending on how your distribution looks.

I've used similar storage schemes for received faxes for an electronic fax service in the past (using longer and longer prefixes of the destination fax number as pathname components).

I guess the reason you're looking at generating the flat files is to amortise the cost of generating the HTML?

Vatine
A: 

500.000 entries each of them approximately 1k in size? So we are talking about 500 MByte of data. If possible I would simply put the whole thing on a ramdisk, if you need filesystem capabilities, keep it in memory as an ordered structure (hastable, array of some sort) if you don't. Is there a specific reason, why you don't store the results in a temporary database table? (SQLite)

merkuro
I think file system can handle huge number of concurrent users about 6,000,000 users (hits). so the hashtable idea is great. but what if we make the users just request files they want without any processing overhead.the solution based on using javascript to request this files based on user input (without any server side code) to maintain (speed ,performance and reliability).
Ammroff
If you have enough RAM, the system will cache the files in it, so using a ramdisk won't be any faster. It will just mean that if you have to reboot for some reason, you have to regenerate the data again, which might mean serious downtime.
rjmunro