views:

109

answers:

1

I have a set of records, stored as XML files, where the XML files are arranged in a tree structure. For each child record, elements or attributes which are not explicitly stated are assumed to be inherited from the parent record. This is easy to model in a database, with a self-referential foreign key, e.g.

Tree Structure


  Foo
  /  \
Bar1 Bar2

Database Table


 id | parent_id | name  | attribute 1 | attribute 2
 1  | null      | Foo   | 0.4         | "tastes like chicken"
 2  | 1         | Bar 1 | null        | "doesn't everything"
 3  | 1         | Bar 2 | 0.2         | null

(In production I would use an additional column(s) to store the tree structure, but this is omitted for the sake of simplicity.)

In this system, Bar1 would inherit a value of 0.4 for attribute 1, and Bar2 would inherit a value of "tastes like chicken" for attribute 2.

As I am far from a guru of any stripe, I have several questions about how best to work with system.

  1. As I need to be able to export from the database in the original format, I can't simply pre-calculate and cache the "missing" records. Or can I? Note that caching would have be somewhat intelligent, as an update may change values in Foo that should propogate down to Bar1 and Bar2.

  2. Are there standard database tools or SQL parameters that could "build" the Bar records correctly in one (or relatively few) queries? I know about common table expressions, and it seems like this is halfway to a solution, but this would return the tree of rows, and then further processing would be required to fill out all attributes appropriately.

  3. Are their other tips and trips for working with this type of data and database structure? Maybe this is a silly question, but I don't have any formal programming education, so many concepts are quite new to me.

Note that creating separate tables for the various branches is not a possibility, as the tree can have arbitrary depth. Anticipated tree size is ~ 3000 child records, with a tree 3 levels deep, and thousands of trees to be stored.

I am working with Django, if that matters, but am semi-comfortable working with raw SQL queries (e.g. I build a Django app to manipulate directed acyclic graphs, which uses mostly raw SQL because of the complicated grouped by and having clauses).

+1  A: 

You did not specify your database, but looking at your link to the directed acyclic graphs I saw that you use PostgreSQL. If you can settle with the newest version 8.4 you can try the following (that should answer your 2nd question):

Using the table

create table Tree(
  id serial primary key check(id > 0),
  parent int references Tree(id),
  name varchar(100) not null unique check(length(name)>0),
  attr1 int,
  attr2 varchar(15),
  constraint charlength check(id > parent)
);

filled with this data

insert into Tree (name, parent, attr1, attr2)
          values ('Foo', null, 5, 'high'), -- will have id 1
                 ('Bar1', 1, null, 'low'), -- will have id 2
                 ('Bar2', 1, null, null),
                 ('Bar3', 2, null, null);

the query

with recursive parents(id, name, attr1, attr2, parent, level, who) as (
  select id, name, attr1, attr2, parent, 1, id
    from Tree t
    where t.name in ('Bar3', 'Bar2')
  union all
  select t.id, p.name, coalesce(p.attr1, t.attr1), coalesce(p.attr2, t.attr2), t.parent, p.level+1, p.who
    from parents p, Tree t
    where t.id = p.parent
  ) select distinct on (who) who, name, attr1, attr2
      from parents
      order by who, level desc;

yields (with my ruby program wrapping the queries):

{"attr1"=>"5", "name"=>"Bar2", "attr2"=>"high", "who"=>"3"}
{"attr1"=>"5", "name"=>"Bar3", "attr2"=>"low", "who"=>"4"}

About your 1st question I'm not sure what you mean. But maybe you can recover the XML from its presentation in the database?!? For your 3rd question I hope to be able to post this weekend an alternative for storing trees. Otherwise I'd say that your ideas look good to me.

jug
Thanks jug. The combination of common table expressions and coalesce is perfect.
Chris Mutel
My favorite way for storing trees is to encode the position in the tree by a unicode string (utf-8), the parent being the string obtained by dropping the last unicode character (in PostgreSQL: substring(treepos from 1 for char_length(treepos)-1)). If you want to know more about it, please tell me.
jug