views:

272

answers:

8

If I have a static database consisting of folders and files, would access and manipulation be faster than SQL server type databases, considering this would be used in a CGI script?

When working with files and folders, what are the tricks to better performance?

+5  A: 

Depends on what your information is and what your access patterns and scale are. Two of the biggest benefits of a relational databases are:

  1. Caching. Unless you're very clever, you can't write a cache as good as that of a DB server

  2. Optimizer.

However, for certain specialized applications, neither of these 2 benefits manifest itself compared to files+folders data store - therefore the answer is a resounding "depends".

As for files/folders, the tricks are:

  • Cache the contents of frequently requested files
  • Have small directories (files in deeply nested small directories are much faster to access than in a flatter structure, due to the time it takes to read the contents of a big directory).
  • There are other, more advanced optimizations (slice across disks, placement on different places in a disk or different partition, etc..) - but if you have need of THAT level, you are better off with a database in the first place.
DVK
+1 However, there are some CGI-specific issues to address here.
Sinan Ünür
I have to disagree with a lot of what you've written: 1) Caching on a DB Server has to be generic. If you write your own given specific application knowledge - you should be able to thump it hands down. 2) Optimiser - Again the optimiser has to be generic; with specific application knowledge you can code significantly more efficient access paths, you could also utilise structures not available within typical RDBMS indexing options. 3) Large directories are only slower if you have to 'search' for files; if you have full path to a file, you won't need to "read the contents of a big directory".
Craig Young
@Sinan - OK, color me needing coffee jolt. What do you allude to by "CGI-specific" issues as far as DB vs files?
DVK
@Craig - I don't know what his usage patterns are. Even what his data is. So your points may or may not be valid - it depends. But does your custom file structure know about placing most used data in faster areas of the disk? Are you an expert in writing good caches? this is why I said "it depends" - without knowing details of his app, I'm not prepared to judge one way or the other on how easy it is to write a custom file based structure for his needs which will beat the DB
DVK
The real benefit of DB over files comes down to: I need an index, so do I write: a) "CREATE INDEX xxx ON Table(Col1, Col2)" or b) Define the index structure in code, write code to read underlying data and populate the index structures, write code that decides when and how to access the indexes created, write code to deal with error scenarios such as the index wasn't created or has been deleted.
Craig Young
@DVK: It doesn't matter whether you know what his usage patterns are. The point is that **he** knows (or should), and his application specific knowledge makes it possible to write better caching. You don't have to be "very clever" to write good caching when you have 'home-field advantage'. The question pertained to general performance difference: A poorly implemented DB will perform just as badly as a poorly implemented file based solution. But a well implemented file based solution will outperform a well implemented DB solution (but at a much higher cost in dev time).
Craig Young
@DVK CGI scripts incur a significant start-up penalty that partially depends on the number of Perl modules and C libraries that are loaded in. If the location in the filesystem of the information queried or to be stored can be very easily deduced from CGI parameters, it is going to be hard to beat a filesystem based approach.
Sinan Ünür
@Sinan - I am spoiled by mod_perl - it eliminates the library load and compilation penalty for the startup speed.
DVK
...Oh and if your database access layer libraries are smart enough to utilize a common pool of DB connections, you will also avoid thecost of opening DB connection to boot (our company's do that)
DVK
@DVK I know the advantages of `mod_perl`. However, the question specifically states a **`CGI`** environment.
Sinan Ünür
A: 

It depends on the profile of the data and what logic you are going to be using to access it. If you simply need to save and fetch named nodes then a filesystem-based database may be faster and more efficient. (You could also have a look at Berkeley DB for that purpose.) If you need to do index-based searches, and especially if you need to join different sets of data based on keys, then an SQL database is your best bet.

I would just go with whatever solution seems the most natural for your application.

Nate C-K
+4  A: 

As a general rule, databases are slower than files.

If you require indexing of your files, a hard-coded access path on customised indexing structures will always have the potential to be faster if you do it correctly.

But 'performance' is not the the goal when choosing a database over a file based solution.

You should ask yourself whether your system needs any of the benefits that a database would provide. If so, then the small performance overhead is quite acceptable.

So:

  1. Do you need to deal with multiple users and concurrent updates? (Well; you did say it's static.)
  2. Do you need flexibility in order to easily query the data from a variety of angles?
  3. Do you have multiple users, and could gain from making use of an existing security model?

Basically, the question is more of which would be easier to develop. The performance difference between the two is not worth wasting dev time.

Craig Young
+1 for considering the security angle which I forgot about
DVK
I would add that the performance benefit only exists if you know what you're doing. Creating a good and fast indexing scheme is not easy. Databases have had several years to fine tune their algorithms even if they are data generic. Most people I've known who try to beat a database with flat files fail at it. But there are some who succeed for the rare case that you need it.
mpeters
+1  A: 

As others have pointed out: it depends!

If you really need to find out which is going to be more performant for your purposes, you may want to generate some sample data to store in each format and then run some benchmarks. The Benchmark.pm module comes with Perl, and makes it fairly simple to do a side-by-side comparison with something like this:

use Benchmark qw(:all) ;

my $count = 1000;  # Some large-ish number of trials is recommended.

cmpthese($count, {
    'File System' => sub { ...your filesystem code... },
    'Database'    => sub { ...your database code... }
});

You can type perldoc Benchmark to get more complete documentation.

John Hyland
A: 

As others have said, it depends: on the size and nature of the data and the operations you're planning to run on it.

Particularly for a CGI script, you're going to incur a performance hit for connecting to a database server on every page view. However if you create a naive file-based approach, you could easily create worse performance problems ;-)

As well as a Berkeley DB File solution you could also consider using SQLite. This creates a SQL interface to a database stored in a local file. You can access it with DBI and SQL but there's no server, configuration or network protocol. This could allow easier migration if a database server is necessary in the future (example: if you decide to have multiple front-end servers, but need to share state).

Without knowing any details, I'd suggest using a SQLite/DBI solution then reviewing the performance. This will give flexibility with a reasonably simple start up and decent performance.

FalseVinylShrub
+12  A: 

I'll add to the it depends crowd.

This is the kind of question that has no generic answer but is heavily dependent on th situation at hand. I even recently moved some data from a SQL database to a flat file system because the overhead of the DB, combined with some DB connection reliability issues, made using flat files a better choice.

Some questions I would ask myself when making the choice include:

  1. How am I consuming the data? For example will I just be reading from the beginning to the end rows in the order entered? Or will I be searching for rows that match multiple criteria?

  2. How often will I be accessing the data during one program execution? Will I go once to get all books with Salinger as the author or will I go several times to get several different authors? Will I go more than once for several different criteria?

  3. How will I be adding data? Can I just append a row to the end and that's perfect for my retrieval or will it need to be resorted?

  4. How logical will the code look in six months? I emphasize this because I think this is too often forgotten in designing things (not just code, this hobby horse is actually from my days as a Navy mechanic cursing mechanical engineers). In six months when I have to maintain your code (or you do after working another project) which way of storing and retrieving data will make more sense. If going from flat files to a DB results in a 1% efficiency improvement but adds a week of figuring things out when you have to update the code have you really improved things.

HerbN
Excellent questions! (In your answer ^^)
Craig Young
Amen. Very good. +1.
DVK
Nice answer, HerbN. Emphasis #4, please. DBI calls will look familiar in a year from now. A custom wheel, not so much.
converter42
+1  A: 

To quickly access files, depending on what you are doing, an mmap can be very handy. I just wrote about this in the Effective Perl blog as Memory-map files instead of slurping them.

However, I expect that a database server would be much faster. It's difficult to say what would be faster for you when we have no idea what you are doing, what sort of data you need to access, and so on.

brian d foy
A: 

From my little bit of experience, server-based databases (even those served on the local machine) tend to to have very slow throughput compared to local filesystems. However, this depends on some things, one of which being asymptotic complexity. Comparing scanning a big list of files against using a database with an index to look up an item, the database wins.

My little bit of experience is with PostgreSQL. I had a table with three million rows, and I went to update a mere 8,000 records. It took 8 seconds.

As for the quote "Premature optimization is the root of all evil.", I would take that with a grain of salt. If you write your application using a database, then find it to be slow, it might take a tremendous amount of time to switch to a filesystem-based approach or something else (e.g. SQLite). I would say your best bet is to create a very simple prototype of your workload, and test it with both approaches. I believe it is important to know which is faster in this case.

Joey Adams