views:

1215

answers:

19

I really enjoy database design and the whole concept of managing data semantically and all the logic that comes with it.

My knowledge level when it comes to databases is however (I would guess) quite basic - I can model data relationships correctly with ER diagrams, connection tables, handling many-to-many, one-to-many etc etc. I'm experienced when it comes to programming in general, and I figure my database knowledge is like knowing the basics of object oriented programming, i.e how to model a car-class, inheriting from the vehicle class, containing wheel objects and so on.

Now I would like to further my knowledge about relational databases so that I may confidently say to an employer that I can handle the subject on a professional level.

All I can handle right now is probably my movie database in the back end of my personal website, which probably would collapse if I was Amazon and had to store millions of movies. So there's the subject of scalability right? I'm sure there's a pretty "standard" array of subjects/concepts within database design that you simply must understand and be able to apply in real life if you're going to work with databases on a professional level.

So, I would be very grateful if any database gurus in the field could namedrop some areas, concepts, case studies or anything that would be beneficial to study to get really good at databases. I'm sure there's a vast science lurking here, and I want it.

Thanks in advance!

+8  A: 

The standard text in the field is "An Introduction to Database Systems", by C. J. Date.

I have twenty years C experience; I read it, thought it excellent and I wrote a relational database because of it (a proper one, not this SQL malarky!).

Blank Xavier
Out of interest: What's a "proper" relational database?
Tomalak
One that is fully and truely relational. SQL isn't.
Blank Xavier
Oh, I see. Is your project public, can I download it somewhere?
Tomalak
Thanks. I'm intrigued. Hope the book is as archetypical as you make it sound.
sharkin
Having read Date's "Database in Depth", I too am implementing a proper relational database.
Pete Kirkham
In what way is SQL not "truly relational", just out of interest? [I just love assertions without arguments to back them up :-) ].
paxdiablo
@Tomalak - sorry, no. I wrote it to write it; it works, but there's a lot of work going from "it works" to "you can make it public".
Blank Xavier
@Pax - that's an excellent and absolutely vital question! fortunately, a guy called C.J.Date wrote a book *all* about this - you should read it! :-)
Blank Xavier
@Blank Xavier: I thought so. This leads me to my other question - since SQL doesn't cut if for you, am I safe to assume that you devised a whole new query language as well?
Tomalak
I've read it, @Blank, but I agree with Codd if you're talking about the NULL issue in particular. There's are situations when you don't know everything about a row and no easier way I know of to do it other than with NULLs. If it's something else, let us know. Not arguing, just genuinely interested.
paxdiablo
My Date is in a box in the shed, I shall have to dig it out and have another read.
paxdiablo
@Pax, perhaps you want to start a new question to get answers on this.
bignose
@Pax just to give one example: SQL treats the order of columns as significant. Relational theory says you shouldn't.
JoeCool
+1  A: 

Well it's always good designing examples... See if there's anyone you know who needs a database for something. But studying VLDB (Very Large DataBase) techniques might be useful depending on the industry you're interested in.

Ian
+1  A: 

I believe optimization on existing databases could be interesting to dwelve into. I.e. why you should denormalize tables.

Some basic relation algebra is useful knowledge and closely related to set theory.

Spoike
+4  A: 

A whole other area is dimensional modelling and data warehousing.

I had been working with relational modelling for years, and then I read The Data Warehouse Toolkit and received an entirely new view of how it could be used.

Andrew Shepherd
+1  A: 

A very common scenario is having to map ugly databases to an entity model which is not necessary reflected directly in the structure of the DB. Working out which way is best to model the data in your domain can be tricky.

Full text search and XML are subjects that seem to be coming up more and more.

I have no experience with it but I know DB2 (of which there is a trial version) has some crazy new features)

Have fun :-)

Karl
+2  A: 

Get more dirt from C. J. Date's Database In Depth: Relational Theory for Practitioners if his An Introduction to Database Systems isn't grubby enough for you.

Seriously, these two books will give you a great deal more knowledge about RDBMSes, in a great deal less space, than many other professional database workers possess. Database in Depth, in particular, looks at how to think about databases relationally even when the language doesn't support it, and how to trick SQL into being a close-to-relational language.

bignose
A: 

I highly recommend to start with www.dbdebunk.com. It has a lot of practical stuff in oppose to theory. The site is a little outdated, but still useful. Even commercial content isn't too expensive, if you really like to become database professional.

Thevs
A: 

i would advise narrowing your scope a little. pick an sql server and become expert at it ... for example get mysql, learn the difference between the storage types, the replication types, etc. implement replication in a couple different ways. get a large dataset and try to optimize queries. do some pivots and optimize your indices for that. investigate backup strategies. see how to increase performance in replication and backups when you have a 10gb database that consistently adds 100,000 transactions every day. write software to insert the records and scripts to do replication and backups.

its hard to become an effective dba with no real experience when you try to cover all sql servers. just focus on one ... i'd suggest mysql or mssql, but whatever floats your boat.

-don

Don Dickinson
+3  A: 

I'll volunteer a list of areas that you might want to consider as aspects of programming with databases. I would not claim that you need to be expert at all of them, or even most of them, in order to be able to program using a DBMS, nor even to program a DBMS. However, they are all topics that are of some relevance at some times - in no particular order:

  • Query language design
  • Query optimization
  • Query rewriting
  • Data types
  • Storage organization
  • Transaction management
  • Communications protocols
  • Encryption
  • Authentication and identification
  • Schema design
  • Replication
  • Backup and restore
  • Two-phase commits
  • Optimistic concurrency control
  • Locking and pessimistic concurrency control
  • Authorization
  • Label-based access control
  • Set theory
  • Relational theory
  • Distributed query
  • Boolean logic
  • User-defined types and functions
  • Catalog management
  • Buffer management
  • Sorting
  • Internationalization (I18N), Localization (L10N), Globalization (G11N)
  • Quantifiers
  • Auditing
  • Triggers
  • Stored procedures

I make no claims of completeness or minimality, either.

Jonathan Leffler
I see no points on optimization of the database itself, are there different kinds? which ones are the typical forms of storage/design optimization?
sharkin
It depends on what you mean by 'optimization of the database'. I did mention 'storage organization', which your second sentence suggest may be what you are thinking of. There are a number of different storage organizations. One of the basic, but very effective, ones is using B-Tree or B*-Tree organizations; you will also find hash widely used. There are more specialized organizations too, such as index-only tables, or bit-map indexes, or R-Tree indexes (for spatial data). As a general purpose structure, the B-Tree variants take quite a lot of beating.
Jonathan Leffler
Ok. Many thanks!
sharkin
+2  A: 

being just a student of Databases I can only speak from my limited scope but I can suggest two sites that may help...

http://database-programmer.blogspot.com/2008/09/comprehensive-table-of-contents.html

This is Kenneth Downs site, he goes from the very basics if SQL and delves into more complex subjects. The man has written a framework around DB's after all.

Another one is High Scalability...

http://highscalability.com/

They get into every realm of DB's.

Hope this helps.

Randin
+1  A: 

It depends on what you want to do with your databases, how does your data look, what are your work flows, how many servers, clients and databases you have to work with ...

So let's pretend that you, like me, have to deal with several databases, not to big (< 100 GB each) and you have many clients with many different needs that make you develop many custom solutions, like producing custom reports or exports. That makes you more of a programmer than a DBA. And what you need is productivity, before performance.

The best solution I came up with, in that situation, is getting rid of SQL, as much as possible. You can achieve that by using some kind of ORM, either home made or existing ones, and thus trading SQL scripting for object programming. Doing that I do in minutes what would takes hours with SQL.

dugres
+1  A: 

I think set + relational algebra is a thing that most database users know little about but would do well to learn. When you appreciate the logic involved behind mapping one relation to another you start to see more clearly why things like normalization are good, why NULLs are best avoided if possible, etc. You also start to see the flaws in SQL compared to purer relational query languages, where features impose limitations on the paradigm due to performance reasons, etc.

Kylotan
+1  A: 

Disclaimer: not an expert in database design.

Some of the performance issues can be handled either by:

  1. denormalizing your database, so to reduce the number of tables to join
  2. adding indexes
  3. filtering should be done so that you first remove the largest of the non matching data, then you cherry pick the next condition on the reduced set. It's better to go from 100 values -> 10 matching first condition -> 1 matching first and second condition than 100 values -> 80 matching second condition -> 1 matching first and second condition. Seems trivial, but it's important to keep in mind.
  4. divide et impera is the motto for scalability. If you have something that scales in a non-linear way, say O(N^2) it makes sense to keep N as low as you can, and you should partition your data set into smaller sets, assuming they are independent and you can work out the partitioning. An example of this is sharding, typically used to keep databases of users in large social websites. (NB: an example, I would not implement it this way) Instead of having a huge database with all the users, they have 26 servers (one for each letter of the alphabet), then they put all the nicknames with the same first letter in the same server. This has the following advantages:

    a. you balance the load on different machines
    b. if one machine crashes, you make the site unaccessible only to a subset of your users, not to all of them
    c. you preselect the search with a highly discriminating criterium (the first letter), then perform the second search (the username)
    d. you reduce the number of entries each database has.

Stefano Borini
+1  A: 

There is only one rigorous technique for conceptually modeling a relational database schema that I know of (and I've spent a lot of time looking). It's confusingly named "Object-Role Modeling". Here are a couple references.

http://www.agilemodeling.com/artifacts/ormDiagram.htm

http://www.tdan.com/view-articles/5033

http://en.wikipedia.org/wiki/Object_role_modeling

http://en.wikipedia.org/wiki/NORMA

and here's a plugin for Visual Studio

le dorfier
+1  A: 

Don't forget representing hierarchy and/or graphs in databases. It can be a pain & there's no right answer.

The standard techniques (for hierarchy at least) have been mentioned in these SO posts:

edit: There's also spatial database applications for GIS use, where you have data structures and/or indices based on point locations using R-trees and the like. Using these is a little bit different than the regular non-spatial database features.

Jason S
+1  A: 

To my mind there are three "tracks" with database skills: Developer, DBA and Architect. From a development perspective you want to focus on development, understand Architect and pick up as much DBA stuff as you need along the way.

As a developer the key thing (to my mind) would be to get your SQL to a really good standard. As an interviewer if I'm looking for a developer I don't care if you can design databases as much as how you can write queries. Assuming you know about your basic CRUD commands, do you know about:

Stored Procedures (not just how to use them but when and why)
Views (ditto, including materialised views)
Triggers (insert, update, delete, how and why)
Cursors (especially impact on performance)
Referential integrity
Transactions
Indexes
Adding defaults, constraints and identities to tables
Complex use of group by and having
Functions especially:
- Date and time manipulation
- String manipulation
- Handling nulls

You should be able to pull any data you need from your database using SQL alone, you should never need to manipulate or parse it in any way using your procedural code (you may choose to but it will be a choice rather than you didn't know how to do it with SQL).

As a developer the one booking I'd look at is Joe Celko's SQL for Smarties. Lots of SQL to do things you may never have really thought about being able to do in SQL.

One of the best ways to learn this stuff is, tedious as it seems, writing reports (management information). I've seen so many people moan about writing reports being tedious and then do it really, really badly (and not just because they didn't try). Reports tend to be close to pure SQL so you have to really get to know the tools at hand and a complex report really exposes those who know SQL from those who don't. People also tend not to want to wait too long for them so performance is key too.

Look at your current database and come up with a bunch of really really awkward things someone might actually want to know. Think marketing, trends, most and least popular. Then try and combine a bunch of them into one query.

In terms of performance I'd also be trying to get inside how the query optimizer works, how it makes decisions about when to use an index and when to table scan, when indexes will help and when they'll hinder.

A good developer doesn't just write good queries, they write quick, maintainable queries. To really get to grips with this you'll need to play around with a database with a dozen (or more tables) containing, ideally, millions of rows. That's when you start seeing queries you thought were fine dragging their heels.

The architect/designer stuff others have covered pretty well. All I'd say on the subject is that for every database that has to be designed there are hundreds of queries that need to be written for it. You might want to consider that proportional break down of work when you're upskilling and make sure your querying is really up to scratch.

In terms of links it depends on the platform - all this stuff tends to be platform specific. But then that's what google is for.

Not I suspect entirely what you want but worth knowing as a lot of people who think they know SQL really really don't...

Jon Hopkins
A: 

Well, to be frankly a database is just a way of storing and accessing data. Pretty much what a file system does too.

A parallel from LDAP is that it is a protocol and by this not a definition of what you can do with it and how it should be implemented, the same could be said about SQL.

So if you want to know more about databases you are actually saying you want to know more about the SQL protocol and/or how to store and fetch data.

You might be interested in searching around for what a 'B-Tree' is and how it is used. Another thing worth looking up is EAV (Entity-Attribute-Value) and why schema's are so important for it.

With that knowledge you could actually role out your own DB and while doing that appreciating what RDBM's already have done for you.

If you want a more practical approach look at the documentation that the open-source PostgreSQL provides, perhaps starting with this.

Martin P. Hellwig
A: 

You could start by reading one of the (almost recent) review papers that focuses on the foundations and trends in Databases: The anatomy of databases

amit-agrawal