views:

222

answers:

4

I am looking for a book or other undeniably authoritative source that gives reasons when you should choose a database vs. when you should choose other storage methods, most notably, a hardcoded table of values in a program.


The rest of this message is just supporting info.

Below are some unofficial lists of reasons to use one or the other, that I have gleaned from other stack overflow questions, other internet sources, my own experience, etc. But none that are as authoratative as a book. For those wondering, I'm trying to strengthen a case I have against a certain implementation that works fine for a mainframe platform in our company, but we share their code, and their use of a database for searching a table (which could easily fit in a program's literal storage and quickly be binarysearched) that is causing us hundreds of hours of extra setup and troubleshooting effort on the Windows platform. (It's a big system, to distributed users, and this pattern appears many times within it).

WHY USE A DATABASE?

if you need...

  • Random Read / Transparent search optimization <=this is the only 'database benefit' this system uses
  • Advanced / varied / customizable Searching and sorting capabilities
  • Transaction/rollback
  • Locks, semaphores
  • Concurrency control / Shared users
  • Security
  • 1-many/m-m is easier
  • Easy modification
  • Scalability
  • Load Balancing
  • Random updates / inserts / deletes
  • Advanced query
  • Administrative control of design, etc.
  • SQL / learning curve
  • Debugging / Logging
  • Centralized / Live Backup capabilities
  • Cached queries / dvlp & cache execution plans
  • Interleaved update/read
  • Referential integrity, avoid redundant/missing/corrupt/out-of-sync data
  • Reporting (from on olap or oltp db) / turnkey generation tools
    [Disadvantages:]
  • Important to get right the first time - professional design - but only b/c it's meant to last
  • s/w & h/w cost
  • Usu. over a network, speed issue (best vs. best design vs. local=even then a separate process req's marshalling/netwk layers/inter-p comm)
  • indicies and query processing can stand in the way of simple processing (vs. flatfile)

WHY USE FLATFILE:

If you only need...

  • Sequential Row processing only
  • Limited usage append only (no reading, no master key/update)
  • Only Update the record you're reading (fixed length recs only)
  • Too big to fit into memory
  • If Local disk / read-ahead network connection
  • Portability / small system
  • Email / cut & Paste / store as document by novice - simple format
  • Low design learning curve but high cost later

WHY USE IN-MEMORY/TABLE

(tables, arrays, etc.):
Pretty much all of these apply to our project's usage of the data
if you need...

  • Processing a single db/ff record that was imported
  • Known size of data
  • Static data if hardcoding the table
  • Narrow, unchanging use (e.g., one program or proc)
    • includes a class that will be shared, but encapsulates its data manipulation
  • Extreme speed needed / high transaction frequency
  • Random access - but search is dependent on implementation

I'm not looking for other suggestions on how to make this work, e.g., SOA, WCF, this-or-that database; a table hardcoded into the program is a clear slam-dunk solution. Unfortunately none of the decision makers on this matter in my organization are in a position to see that. I'm just looking for good, authoratative documentation so I don't have to spend days and days writing up the reasons, documenting support calls, etc.

Although I am also kind of looking for items for these lists that I might have missed ;-)

+2  A: 

The question is biased : it starts from the idea that a database is something different than a flat file manager. Database Management Systems are here to manage files, all kind of files.

For example, using Oracle Database Server you can do very well all the things you list in "FLATFILE" or "IN-MEMORY/TABLE". With Oracle Database you can have all your data in memory if you wish to, or even tell oracle to process sql queries on data coming from any file on the file system. Oracle Database is just an example, and you could do that with many DBMS.

The question you should ask is why you need other tier than the database tier to manage data. Is it for integration purpose ? Is it for presentation purpose ? And in those cases, is it pertinent to have local data files that are not managed by your DBMS ? For which reason ? performance ? scalability ? security ? availability ? testability ?

I think you can find lots of books telling you that you should do this or that, but none will be authoritative as they'll contradict themselves.

In the case you describe (a COBOL software getting data from many sources), if you want your managers to accept your "in-memory" or "flat file" solution, you should prove its a better solution than getting the data from multiple sources over a network.

Take the following timings as a reference :

http://norvig.com/21-days.html#answers

Approximate timing for various operations on a typical PC:

execute typical instruction 1/1,000,000,000 sec = 1 nanosec
fetch from L1 cache memory  0.5 nanosec
branch misprediction    5 nanosec
fetch from L2 cache memory  7 nanosec
Mutex lock/unlock   25 nanosec
fetch from main memory  100 nanosec
send 2K bytes over 1Gbps network    20,000 nanosec
read 1MB sequentially from memory   250,000 nanosec
fetch from new disk location (seek) 8,000,000 nanosec
read 1MB sequentially from disk 20,000,000 nanosec
send packet US to Europe and back   150 milliseconds = 150,000,000 nanosec 

Thoses numbers don't have to reflect your context exactly, it's only their relative value between each other that counts. With those numbers, and a few hypothesis, you can compute :

  • time to do the functionality you want having data in-memory
  • time to do the functionality you want having data locally on a flat file
  • time to do the functionality you want having data distributed on many nodes on a network.

With a few numbers for each case, you don't have "authoritative sources" (authority is relative) to support your choice, you have a proof that your choice is better than the other choices. And it is the other people job to prove your reasoning is wrong.

Jérôme Radix
Jerome, I totally agree with some points, unfortunately it's not about any of those things - it's about portability - we're running COBOL code on multiple platforms to perform financial/investment value projections (yes, based on the systems and personnell to which our shop is committed, this was the best decision by FAR - strange as it sounds I'll admit). This is a a business logic module that gets all its OTHER values (and db usage) from its respective presentation layers and therefore requires, unneccessarily, complete DB connection and its overhead/maintainence, on all platforms - big $$.
FastAl
FastAl
In the case you describe, if you want your managers to accept your "in-memory" or "flat file" solution, you should prove its a better solution than getting the data from multiple sources over a network. I update my answer with information on that point.
Jérôme Radix
Jerome - wow, great info! You will be happy to know we already DID refactor some of this system to use in-memory - for the PC only, and for 1 out of several products only (but people were wanting to switch it back to DB? I don't think they are anymore) ... Compute time for the worst case went from 2 min to 10 sec. Average case went from 30 sec to 5 sec. WOW!!! (I was bummed, I'd hoped for 10x >:-) ... As this calc was part of a user interface, even 30 sec was obviously untenable, it is repeated throughout the use of the program too. So myself and a coworker had to drop everything for 2 weeks..
FastAl
A: 

I have to agree with other opinions here. Software engineering typically doesn't work like other engineering fields: problems do not lend themselves to a specific solution or set of solutions, and more often than not, sources who state otherwise are easily proven wrong. It may be true that in many or most situations, approach X is better, but then your task is to show that your situation is included in "many" or "most", and in a neutral environment, that's not always easy.

In a biased environment, one where there is an existing approach that inertia favors, you're not arguing for a specific implementation as much as against a particular set of opinions, and that is many times more difficult. Even demonstrating that the current approach is more painful or costly or whatever can fall on deaf ears; if that is the case, even if it were possible to show that there is a different way that is better in 100% of the business cases that can be presented, whoever sponsored the existing approach is likely to shoot yours down out of spite.

Dave DuPlantis
So sad, and so right... Of course I know some other engineers and I'm not sure they'd agree with your distinction ;-)
FastAl
A: 

As already pointed out, there's no definitive answer from an Engineering standpoint.

Be flexible! Implement both with some isolation layer. And measure the actual results in your case.

Besides, it looks like there are some strong constraints on the values you're using, if "a hardcoded table of values in a [COBOL] program" is an actual solution. So, they're a sort of fat bunch of constants?

pascal
As far as isolation layer - well, the system is not only already written and engineered, it's mature. But the way things are sucks so bad it merits refactoring. Yes, it is data that never changes, it is a bunch of constants. They were as difficult to come up with as the code that uses them, and one never changes without the other (algorithms are tightly intertwined) And every other reason you could think of NOT to use a database!! Except for not wanting to implement a binary search, and wanting the program to fit in the memory constraints of an 80's mainframe.
FastAl
+5  A: 

From fundamentals of relational database management systems:

1.5 Objectives of DBMS

The main objectives of database management system are data availability, data integrity, data security, and data independence.

1.5.1 Data Availability

Data availability refers to the fact that the data are made available to wide variety of users in a meaningful format at reasonable cost so that the users can easily access the data.

1.5.2 Data Integrity Data

integrity refers to the correctness of the data in the database. In other words, the data available in the database is a reliable data.

1.5.3 Data Security

Data security refers to the fact that only authorized users can access the data. Data security can be enforced by passwords. If two separate users are accessing a particular data at the same time, the DBMS must not allow them to make conflicting changes.

1.5.4 Data Independence

DBMS allows the user to store, update, and retrieve data in an efficient manner. DBMS provides an “abstract view” of how the data is stored in the database. In order to store the information efficiently, complex data structures are used to represent the data. The system hides certain details of how the data are stored and maintained.

[snip]

1.9 Drawbacks of File-Based System

The limitations of file-based approach are duplication of data, data dependence, incompatible file formats, separation, and isolation of data.

1.9.1 Duplication of Data

Duplication of data means same data being stored more than once. This can also be termed as data redundancy. Data redundancy is a problem in filebased approach due to the decentralized approach. The main drawbacks of duplication of data are:

  • Duplication of data leads to wastage of storage space. If the storage space is wasted it will have a direct impact on cost. The cost will increase.
  • Duplication of data can lead to loss of data integrity; the data are no longer consistent. Assume that the employee detail is stored both in the department and in the main office. Now the employee changes his contact address. The changed address is stored in the department alone and not in the main office. If some important information has to be sent to his contact address from the main office then that information will be lost. This is due to the lack of decentralized approach.

1.9.2 Data Dependence

Data dependence means the application program depends on the data. If some modifications have to be made in the data, then the application program has to be rewritten. If the application program is independent of the storage structure of the data, then it is termed as data independence. Data independence is generally preferred as it is more flexible. But in file-based system there is program-data dependence.

1.9.3 Incompatible File Formats

As file-based system lacks program data independence, the structure of the file depends on the application programming language. For example, the structure of the file generated by FORTRAN program may be different from the structure of a file generated by “C” program. The incompatibility of such files makes them difficult to process jointly.

1.9.4 Separation and Isolation of Data

In file-based approach, data are isolated in separate files. Hence it is difficult to access data. The application programmer must synchronize the processing of two files to ensure that the correct data are extracted. This difficulty is more if data has to be retrieved from more than two files.

The draw backs of conventional file-based approach are summarized later:

  1. We have to store the information in a secondary memory such as a disk. If the volume of information is large; it will occupy more memory space.
  2. We have to depend on the addressing facilities of the system. If the database is very large, then it is difficult to address the whole set of records.
  3. For each query, for example the address of the student and the list of electives that the student has chosen, we have to write separate programs.
  4. While writing several programs, lot of variables will be declared and it will occupy some space.
  5. It is difficult to ensure the integrity and consistency of the data when more than one program accesses some file and changes the data.
  6. In case of a system crash, it becomes hard to bring back the data to a consistent state.
  7. “Data redundancy” occurs when identical data are distributed over various files.
  8. Data distributed in various files may be in different formats hence it is difficult to share data among different application (Data Isolation).

[snip]

1.16 Situations where DBMS is not Necessary

It is also necessary to specify situations where it is not necessary to use a DBMS. If traditional file processing system is working well, and if it takes more money and time to design a database, it is better not to go for the DBMS. Moreover if only one person maintains the data and that person is not skilled in designing a database as well as not comfortable in using the DBMS then it is not advisable to go for DBMS. DBMS is undesirable under following situations:

  • DBMS is undesirable if the application is simple, well-defined, and not expected to change.
  • Runtime overheads are not feasible because of real-time requirements.
  • Multiple accesses to data are not required.

Compared with file systems, databases have some disadvantages:

  1. High cost of DBMS which includes:
    • Higher hardware costs
    • Higher programming costs
    • High conversion costs
  2. Slower processing of some applications
  3. Increased vulnerability
  4. More difficult recovery

It would suggest then, that when any of these criteria are of primary importance, a DBMS has been designed specifically for that reason. This does not mean that no other system can achieve your goals, in fact the objectives of other competing technologies may overlap, but knowing what problem the database solves you will have to use your judgement to decide if it is right for the task.

I would encourage you to read the remainder of that chapter, as it discusses some of these topics in more detail.

Robert
Wow, EXACTLY what I was looking for! Buying the book NOW!!! I can run down all three sections and for just about each point argue in favor of the table-in-program, for db, 'don't need it, doesn't apply, overkill...' and for file disadvantages, 'not an issue, static data, code always needs to change with it, not an issue, that's OK, etc.' etc. THANK YOU!
FastAl