views:

382

answers:

8

The company I work for is trying to switch a product that uses flat file format to a database format. We're handling pretty big files of data (ie: 25GB/file) and they get updated really quick. We need to run queries that randomly access the data, as well as in a contiguous way. I am trying to convince them of the advantages of using a database, but some of my colleagues seem reluctant to this. So I was wondering if you guys can help me out here with some reasons or links to posts of why we should use databases, or at least clarify why flat files are better (if they are).

A: 

SQL ad hoc query abilities are enough of a reason for me. With a good schema and indexing on the tables, this is fast and effective and will have good performance.

Oded
+8  A: 
  1. Databases can handle querying tasks, so you don't have to walk over file manually. Databases can handle very complicated queries.
  2. Databases can handle indexing tasks, so if tasks like get record with id = x can be VERY fast
  3. Databases can handle multiprocess/multithread access.
  4. Databases can handle access from network
  5. Databases can watch for data integrity
  6. Databases can update data easily (see 1) )
  7. Databases are reliable
  8. Databases can handle transactions and concurrent access
  9. Databases + ORMs let you manipulate data in very programmer friendly.
Andrey
+1  A: 

Don't build it if you can buy it.

I heard this quote recently, and it really seems fitting as a guide line. Ask yourself this... How much time was spent working on the file handling portion of your app? I suspect a fair amount of time was spent optimizing this code for performance. If you had been using a relational database all along, you would have spent considerably less time handling this portion of your application. You would have had more time for the true "business" aspect of your app.

G Mastros
Actually, the whole application are just a couple of weird bash scripts ... the whole system is a one man show moving files around. Sad, I know ...
hyperboreean
Cool, but last I checked the best databases are free.
Rook
Alas, the converse is equally true. A better saying is "Buy good solutions that are tailored to your needs if they exist, otherwise build it"
DA
+1  A: 

Databases all the way.

However, if you still have a need for storing files, don't have the capacity to take on a new RDBMS (like Oracle, SQLServer, etc), than look into XML.

XML is a structure file format which offers you the ability to store things as a file but give you query power over the file and data within it. XML Files are easier to read than flat files and can be easily transformed applying an XSLT for even better human-readability. XML is also a great way to transport data around if you must.

I strongly suggest a DB, but if you can't go that route, XML is an ok second.

Scott Root
But Oracle and SQL Server cost money, why pay for something when its better for free? MySQL all the way.
Rook
If they have a 25gb CSV file, this could easily double in size (if not more so) with XML tags for rows and columns. Just saying significant bloat is a consideration when moving from flat files to XML.
Binary Worrier
@Scott Root: I personally tend to dislike XML because I see it as a heavy method of passing data around.
hyperboreean
Instead of Oracle or SQL Server, you could also use PostgreSQL. Very powerfull and XML and csv are also possible as in- output. Plain XML will be very slow, way too much overhead.
Frank Heikens
A: 

What about a non-relational (NoSQL) database such as Amazon's SimpleDB, Tokio Cabinet, etc? I've heard that Google, Facebook, LinkedIn are using these to store their huge datasets.

Can you tell us if your data is structured, if your schema is fixed, if you need easy replicability, if access times are important, etc?

Victor P
We're looking into that as well ... first we need to make sure that we're all on the same page. Though, if you need to run some complex reports, I am not sure how nosql handles this.
hyperboreean
A: 

what types of files is not mentioned. If they're media files, go ahead with flat files. u probably just need a DB for tags and some way to associate the "external BLOBs" to the records in the DB. but if full text search is something u need, there's no other way to go but migrate to a full DB.

another thing, your filesystem might provide the ceiling as far as number of physical files are concerned.

stillstanding
A: 

They're faster; unless you're loading the entire flat file into memory, a database will allow faster access in almost all cases.

They're safer; databases are easier to safely backup; they have mechanisms to check for file corruption, which flat files do not. Once corruption in your flat file migrates to your backups, you're done, and you might not even know it yet.

They have more features; databases can allow many users to read/write at the same time.

They're much less complex to work with, once they're setup.

Dean J
+3  A: 

This is an answer I've already given some time ago:

It depends entirely on the domain-specific application needs. A lot of times direct text file/binary files access can be extremely fast, efficient, as well as providing you all the file access capabilities of your OS's file system.

Furthermore, your programming language most likely already has a built-in module (or is easy to make one) for specific parsing.

If what you need is many appends (INSERTS?) and sequential/few access little/no concurrency, files are the way to go.

On the other hand, when your requirements for concurrency, non-sequential reading/writing, atomicity, atomic permissions, your data is relational by the nature etc., you will be better off with a relational or OO database.

There is a lot that can be accomplished with SQLite3, which is extremely light (under 300kb), ACID compliant, written in C/C++, and highly ubiquitous (if it isn't already included in your programming language -for example Python-, there is surely one available). It can be useful even on db files as big as 1GB, possible more.

If your requirements where bigger, there wouldn't even be a discussion, go for a full-blown RDBMS.

As you say in a comment that "the system" is merely a bunch of scripts, then you should take a look at pgbash.

voyager