views:

80

answers:

5

My web application stores product information in XML files on disk, on the web server. This is perfectly fine when it comes to a few products, however I have my worries that large amounts of files may cause problems.

So let's say I'm gonna have 20,000 products, that would mean having 20,000 XML files inside a directory. I'm not familiar with web server disk storage infrastructure, would so many files cause problems like significant drop in access speed and/or excessive disk fragmentation? Do storage servers even fragment, is fragmentation an issue that I need to worry about on servers?

I would prefer to keep my xml files individual because I can access them directly as static content via http giving me much faster access speed and caching. The alternative would be creating a big binary data file and storing each product data xml inside that file in binary mode, then use a server side script to extract the xmls from that big data file. (Yes I know I can just save them in a database but that is not the case I'm interested in.)

A: 

What size are the files? How many hits/second? What's the relative popularity of each file? How many disks? How much RAM? Are you using RAID?

Basically - it depends.

nfm
This is not even an attempt at an answer, try giving explanations for different scenarios.
Russ Bradberry
My point is that there _is_ no answer - this is a hugely subjective question, dependent on so many factors that without experimenting with several approaches, meaningful results won't be achieved.
nfm
A: 

Take a look at the Berkeley DB XML database system. You can keep your native XML while gaining all the ACID benefits of a DB.

Keep in mind that Disk IO is going to be among the most expensive of operations.

Link: http://www.oracle.com/database/berkeley-db/xml/index.html

Shaun
Thanks for the suggestion. This is pretty much what I was referring to when I said storing the files in a big file. The oracle solution is not really useful for me, my code must not have any server requirements besides the standard PHP install.I would be interested if there's any pure PHP solution that can achieve similar goals as the Oracle XML DB, if any. Otherwise I'll proly have to write my own lame version.
A: 

Good idea to limit the number of files or directories in any particular directory.

One strategy is, if you've got unique identifiers for each XML file, create a folder structure that uses that identifier.

e.g.

product 000123 is stored in:

products\00\01\23\product.xml

and product 019384 is stored in:

products\01\93\84\product.xml

That'll reduce the number of items in any particular folder to 100, which is fairly reasonable.

Hope that helps.

davewasthere
A: 

if you get to a point to where you have that many products then I highly recommend using a database system of some sort. if your main concern is caching, there are plenty of caching methods out there that will provide static-like performance for database driven systems. Plus, if your company is at a point to where it has 20,000 products to manage, then managing a database table is the least of their worries ;)

Russ Bradberry
I'm starting to see that the best solution would be just to go for using a database from the start since caching is not really a major issue.
yeah, that and you should probably do it while you have a fairly small number of products. This way your import overhead remains small.
Russ Bradberry
A: 

A database is the way to go. If you don't want external dependencies you could go with sqlite. It is built into php and enabled by default on the current versions of php.

The underlying datastore is typically single file.

Craig