Storing graphs in fully-normalized relational databases

views:

answers:

+3 Q:

Storing graphs in fully-normalized relational databases

I've nearly killed myself, multiple times, trying to find the perfect, flexible schema for storing many different types of objects with a wide variety of links between them in a relational database. But models such as EAV and things like polymorphic relationships just aren't true to relational databases.

After learning about NoSQL graph databases, I realized I had been trying to fit a square peg in a round hole. (I look forward to trying it out sometime but can't right now.) Looking for a fresh approach, I found myself asking a new question: Why am I not using my code to dynamically create this flexibility?

If you have n different entities in n different tables, why not let your code generate n(n+1)/2 "link" tables and the queries between them? Would this not result in a true graph in a normalized schema?

I'm sure this has been done before, but I've never seen it. What am I missing?

Edit:

Examples of application:

Storing printed works and their bibliographic data (there could be many fields which might link not only to strings but whole objects). In the library world, there are no simple (and relational) schemas which can store data "losslessly" without extremely complex, manual schemas.
Keeping track of physical objects, including owner, maintainer, customer, etc. as well as all relevant properties for each specific type of object (which can vary wildly). These objects have many relationships, of different types, amongst themselves.

Edit 2:

Piet asks, "What problem are you trying to solve?"

Andrew says, "I would hate to be the guy who has to understand or maintain it after it's gone into production."

Perhaps I should clarify:

The problem is that maintaining a schema which constantly grows (or simply needs to be large to fully describe your data) is nearly impossible. Why are we doing this manual and predictable task of creating tables for links (or "edges", in the graph context)?

Does the fact that we traditionally do this by hand limit their complexity and power?

In a highly interlinked database, there will always be exponentially more edges than vertices. Why not focus on creating proper, normalized verticles (tables) and let our code maintain the edges?

+2 A:

This depends wholly on the definition of your graph.

The only "true" way to store a graph, in a relation database or otherwise, is a simple adjacency list (or one of its variants). Everything else is a derivative, specialization, or optimization of this technique, and depends on knowledge of the problem domain.

The method you describe in your question is essentially de- or re-normalizing this universal adjacency list into number of "typed" adjacency lists (or link tables), which may or may not be more appropriate, depending on your problem.

I'm sure this has been done before, but I've never seen it. What am I missing?

You're probably not missing anything: it's actually extremely rare to need to store a general graph like this. What problem are you trying to solve?

Addendum

In a highly interlinked database, there will always be exponentially more edges than vertices. Why not focus on creating proper, normalized verticles (tables) and let our code maintain the edges?

I think this is much more common than you might think. I'm mainly familiar with Python, but all the major ORMs / RDBMS toolkits available for it (SQLAlchemy, Django, SQLObject, ...) support automatic maintenance of many-to-many link tables as a standard feature.

Piet Delport 2010-10-16 22:19:44

Examples of applications added. It's not meant to be general; the problem is that even specific problems do not fit into a neat schema. I'm just questioning the need of human-generated tables. Does the fact that we traditionally make them by hand limit their complexity and power?

Tim 2010-10-16 22:35:27

Tim: I'm not sure there's any tradition of manually generating tables. It is very common to generate SQL schema definitions using ORM or similar tools. For example, if you use Django (a very widely used web application framework for Python), all of this is automatic, including the implicit generation and use of link tables for [many-to-many relationships](http://docs.djangoproject.com/en/1.2/topics/db/models/#many-to-many-relationships).

Piet Delport 2010-10-16 22:50:42

You are very correct. But even in Rails, I was searching for something like this. For example, with the bibliography example, there might be 10 different types of publications with a table for each, with a one-to-many relationship to authors for each. I'm not sure how else it could be done without losing some normalization. At work, if we had a normalized database, there would be around 100 entity tables, and potentially hundreds of link tables. Our current definition of SQL "automation" would seem to break down at that level.

Tim 2010-10-16 23:12:40

Tim: but what is the problem? If the 10 different types have publications have things in common, you can extract one parent table containing things like the author link, and keep 10 child tables for the subtype-specific things. If they *don't* have anything in common, then by definition you've reached the limit of re-use, and have to maintain their specifics separately.

Piet Delport 2010-10-16 23:29:21

I agree with your comment but wonder how you scale from that example to having 20 top-level classes with 10 subclasses each. How does a website like Amazon or IMDB design their database? I would imagine Amazon would have to have hundreds of tables, but I suppose given their resources, managing it wouldn't be hard. (By the way, I found something that might be very close to what I'm thinking of: http://www.andromeda-project.org/)

Tim 2010-10-17 00:39:15

The same way you scale anything else. Again, if the 20 top-level classes have commonalities, you factor it out; if they're unique, you can't do any better. Regarding Amazon, they're not even *remotely* small enough to be able to use any existing RDBMS: they're at a scale where you research and build your own storage infrastructure from scratch, such as S3 and [Dynamo](http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html). I'm not sure about IMDb, but as a subsidiary of Amazon, they likely use those too.

Piet Delport 2010-10-17 04:26:34

What do you find interesting about Andromeda? Unless i'm missing something, it seems pretty limited and cruddy compared to the standard offerings in Python, Ruby, or similar.

Piet Delport 2010-10-17 04:43:00

+3 A:

Your idea would certainly create a completely flexible schema that can represent any kind of object graph. I would hate to be the guy who has to understand or maintain it after it's gone into production.

One benefit in a well designed data schema is the constraints. I'm not just refering to the physical column constraints you can define, but the constraints imposed by the overall structure. There are a fixed set of explicit relationships, and this provides well defined paths to follow.

In your scenario, there would always be a large number of paths from one entity to another. How would somebody know which path was the "right" path. The "right" path will simply be "the set of relationships the developer chose to populate".

Imagine a database that has these relationships.

Customer <===> Invoice <===> InvoiceLineItem <====> Product

If I'm looking at this, and somebody asks me: "Give me a list of customers and for each customer a list of product's they've bought", I would know how to write the query.

But, if this was a graph where everything pointed to everything else, how will I know which path is the "right" path. Will it be the "Customer_Product" relationship, the "Customer_Invoice_Line_Item" to "Customer_Product", or "Customer_Invoice" to "Invoice_Product", or "Customer" to "Invoice" to "Invoice_Line_Item" to "SomeOtherTableIHaven'tEvenLookedAtYet" to "Product"? The answer can be "It should be obvious", but it is very common for something to be obvious to one developer only.

Andrew Shepherd 2010-10-16 22:29:14

Edited OP for clarification. I like constraints, but I often need more than I could possibly maintain by hand. For sanity's sake, either normalization or power is sacrificed. It should be easy to define the constraints, but why not generate and access the tables programmatically?

Tim 2010-10-16 22:53:43

+2 A:

why not let your code generate n(n+1)/2 "link" tables and the queries between them?

Any time I see anything in Computer Science where the answer comes out to be "about n-squared", I immediately think that the answer is wrong. :-)

But more realistically, when "n" gets to be a moderate size, the number of link-tables gets to be enormous, really, really quick. So much so that you can't say that this methodology could represent a general-purpose solution, IMO.

But here's my real objection -- your proposed methodology isn't a viable engineering solution. Engineering is all about making tradeoffs, and this method trades a LOT for generality's sake. For example, here's what you lose by using your method over a tried-and-true "traditional" database design:

You lose the ability to have a discoverable schema -- the number of tables gets out of hand so quickly, anyone looking at your table design can't know what the relationships are.
Almost no kind of data integrity can be enforced by the database other than the most basic referential kind -- all code which uses the database must be careful not to break the rules, or you have data corruption.
You end up potentially having a very large number of tables which model relationships that don't really exist in your business domain. When you use a "link" table, you are essentially modeling a many-to-many relationship, which may or may not exist in the real world.
You potentially lose enormous amounts of speed, and incur a very large penalty in terms of storage used. It's far more efficient to model 1:N relationships by referring to the "parent" entity in the "child" entity directly.

Dave Markle 2010-10-16 23:04:06

Very true--but this "tried-and-true" traditional database design is precisely what I am shooting for. But it gets nearly impossible when you need to store so much data! Imagine if you were told to take all the "data tables" on Wikipedia (or the CIA Factbook) and store it (normalized!) using traditional design. How would one approach this problem? (I really am curious and have no idea)

Tim 2010-10-16 23:21:13

I'm sure Wikipedia has something simple, akin to a table of articles... They don't try to create a fully-connected semantic representation of the meaning of these data for analysis. I'm sure there's work being done on this, but I doubt they use an RDBMS for it... Maybe you should ask these guys what they're doing: http://www.npr.org/templates/story/story.php?storyId=103545843 :)

Dave Markle 2010-10-16 23:29:06

@Dave - I was under the impression Wikipedia's data tables are EAV or something similar. And I can only wonder how IBM managed that. However, in my utopia there would exist only one relational database containing all the world's data in a single, truly normalized schema ;)

Tim 2010-10-17 00:44:23

You can see Wikipedia's schema here: [Manual:Database layout](http://www.mediawiki.org/wiki/Manual:Database_layout). Pages, revisions, and article texts each have their own single tables.

Piet Delport 2010-10-17 04:53:40

ansaurus

tags:

views:

answers:

Storing graphs in fully-normalized relational databases

Addendum

related questions