views:

375

answers:

3

I'm thinking about table layout for a Hibernate-managed class hierarchy, and certainly the table per subclass technique strikes me as the most appropriate in a general sense. However, thinking through the logic I have some concerns about its performance especially as the number of subclasses scale.

To give a very brief (and classic) example, let's say you have the following classes:

public abstract class Animal {
   int pkey;
   String name;
}

public class Dog extends Animal {
   long numSlippersChewed; // int is not large enough...
}

public class Cat extends Animal {
   short miceCaught; // ... but here int is far bigger than required :-)
}

(I'm eliding getters and setters and Hibernate mappings etc., just assume they're the basic obvious case).

The database tables for these entities make sense, you get nice denormalisation and so on. However, what query does Hibernate do in order to pull out an individual animal? I can think of at least two cases where this might happen:

  1. Some other entity having a one-to-one (or one-to-many) mapping, such as a pet field of a Human class. This would store the pkey, so when Hibernate fetches a Human object it will need to fetch the corresponding Animal object too. When given the pkey of the animal, what query(/ies) will Hibernate use to extract and unmarshall the actual Animal data, given that it could reside in the Cat or Dog tables?
  2. HQL such as from Animal where name='Rex' (let's assume names are unique). This is similar to the above in that it lets you identify a row in the superclass table but you don't know which subclass table to inspect for further details. Does HQL even let you issue a query from an abstract class? (Using subclass specific stuff works nicely though, e.g. from Cat where miceCaught > 5).

I can think of two ways that this could be done in SQL and neither seems pretty. One is to run an exists query on each subclass table for the given pkey and then load from the table that returned a hit. Alternatively Hibernate could perform some horrible union query joining in all the tables - essentially simulating the table-per-hierarchy scheme in that the result set would include attributes for all possible subclasses with the individual selects from the subclass tables returning null for the irrelevant arguments. This latter case would probably even need to add a synthetic discriminator column so that Hibernate could know which subclass table actually returned the row and thus what Java class they should be parsed into.


Things get hairier too if you have subtypes of concrete types:

public class Greyhound extends Dog {
   float lifetimeRacingWinnings;
}

Now for a given animal pkey, there may be valid rows in the Dog and Greyhound tables, meaning that my first approach of manually checking the class that corresponds to a pkey gets a lot tougher.

The reason I'm so concerned is that I will be wanting to use this approach on a class hierarchy with about 70 classes with a maximum nesting chain of 4-5 levels, so performing a union query on all of that is likely to have horrible performance. Does Hibernate have any tricks up its sleeve to keep this relatively performant? Or is loading a reference to one of these classes by pkey going to take a long time?

+3  A: 

You'll find that Hibernate writes the query for an unknown animal type with a series of LEFT JOIN statements, one per subclass. So the query will slow as the number of subclasses increases, and will attempt to return an ever wider result set. So you are correct, it doesn't scale well with large class hierarchies.

With HQL, yes you can query the subclass directly, and access its properties. That will then be rendered with a single INNER JOIN.

I haven't tried this with multiple levels of inheritance. If the above hasn't put you off yet, suggest you try it and see - you can turn on SQL debug output to see what is being sent to the database, or simply profile your database.

David M
Thanks - I considered testing myself but for some reason didn't think it would be worthwhile without a realistic body of data for performance testing, and I didn't want to *start* developing this way with my concerns. However you're right that a dummy test case will at least let me see the techniques used, and reduce this from a Hibernate performance question to a SQL performance question, which should be much easier to reason about.
Andrzej Doyle
Referring to your third paragraph: as the SQL I posted shows, the multiple levels of inheritance don't seem to change much. So far as I can tell, the only difference is the order of the clauses in the `case` statement used to generate a synthetic discriminator. So certainly from a performance perspective this should be identical. (Of course there will be more tables to join when concretely instantiating a subclass but that doesn't concern me too much).
Andrzej Doyle
Marking this as accepted since on reflection it clearly answers the question I asked, even though it doesn't provide the magic bullet I was hoping for.
Andrzej Doyle
Thanks. Good question, and great to get some post question feedback from real research!
David M
+1  A: 

After David M's helpful answer I decided to throw together a skeleton test.

I created an abstract superclass, ADTestA, and 25 concrete subclasses in a three-level hierarchy (I expect you can guess their names). Each class had a single integer field with a name corresponding to its letter - so for example, class ADTestG has a single int field g in addition to the b field it inherits from its immediate parent ADTestB, and the pkey and a fields from the top-level abstract superclass.

Issuing the HQL query from ADTestA where pkey=1 resulted in the following SQL:

select adtesta0_.pkey as pkey0_, adtesta0_.a as a0_, adtesta0_1_.b as b1_,
       adtesta0_2_.c as c2_, adtesta0_3_.d as d3_, adtesta0_4_.e as e4_,
       adtesta0_5_.f as f5_, adtesta0_6_.g as g6_, adtesta0_7_.h as h7_,
       adtesta0_8_.i as i8_, adtesta0_9_.j as j9_, adtesta0_10_.k as k10_,
       adtesta0_11_.l as l11_, adtesta0_12_.m as m12_, adtesta0_13_.n as n13_,
       adtesta0_14_.o as o14_, adtesta0_15_.p as p15_, adtesta0_16_.q as q16_,
       adtesta0_17_.r as r17_, adtesta0_18_.s as s18_, adtesta0_19_.t as t19_,
       adtesta0_20_.u as u20_, adtesta0_21_.v as v21_, adtesta0_22_.w as w22_,
       adtesta0_23_.x as x23_, adtesta0_24_.y as y24_, adtesta0_25_.z as z25_,
       case
           when adtesta0_6_.pkey is not null then 6
           when adtesta0_7_.pkey is not null then 7
           when adtesta0_8_.pkey is not null then 8
           when adtesta0_9_.pkey is not null then 9
           when adtesta0_10_.pkey is not null then 10
           when adtesta0_11_.pkey is not null then 11
           when adtesta0_12_.pkey is not null then 12
           when adtesta0_13_.pkey is not null then 13
           when adtesta0_14_.pkey is not null then 14
           when adtesta0_15_.pkey is not null then 15
           when adtesta0_16_.pkey is not null then 16
           when adtesta0_17_.pkey is not null then 17
           when adtesta0_18_.pkey is not null then 18
           when adtesta0_19_.pkey is not null then 19
           when adtesta0_20_.pkey is not null then 20
           when adtesta0_21_.pkey is not null then 21
           when adtesta0_22_.pkey is not null then 22
           when adtesta0_23_.pkey is not null then 23
           when adtesta0_24_.pkey is not null then 24
           when adtesta0_25_.pkey is not null then 25
           when adtesta0_1_.pkey is not null then 1
           when adtesta0_2_.pkey is not null then 2
           when adtesta0_3_.pkey is not null then 3
           when adtesta0_4_.pkey is not null then 4
           when adtesta0_5_.pkey is not null then 5
           when adtesta0_.pkey is not null then 0
       end as clazz_
from ADTestA adtesta0_
           left outer join ADTestB adtesta0_1_ on adtesta0_.pkey=adtesta0_1_.pkey
           left outer join ADTestC adtesta0_2_ on adtesta0_.pkey=adtesta0_2_.pkey
           left outer join ADTestD adtesta0_3_ on adtesta0_.pkey=adtesta0_3_.pkey
           left outer join ADTestE adtesta0_4_ on adtesta0_.pkey=adtesta0_4_.pkey
           left outer join ADTestF adtesta0_5_ on adtesta0_.pkey=adtesta0_5_.pkey
           left outer join ADTestG adtesta0_6_ on adtesta0_.pkey=adtesta0_6_.pkey
           left outer join ADTestH adtesta0_7_ on adtesta0_.pkey=adtesta0_7_.pkey
           left outer join ADTestI adtesta0_8_ on adtesta0_.pkey=adtesta0_8_.pkey
           left outer join ADTestJ adtesta0_9_ on adtesta0_.pkey=adtesta0_9_.pkey
           left outer join ADTestK adtesta0_10_ on adtesta0_.pkey=adtesta0_10_.pkey
           left outer join ADTestL adtesta0_11_ on adtesta0_.pkey=adtesta0_11_.pkey
           left outer join ADTestM adtesta0_12_ on adtesta0_.pkey=adtesta0_12_.pkey
           left outer join ADTestN adtesta0_13_ on adtesta0_.pkey=adtesta0_13_.pkey
           left outer join ADTestO adtesta0_14_ on adtesta0_.pkey=adtesta0_14_.pkey
           left outer join ADTestP adtesta0_15_ on adtesta0_.pkey=adtesta0_15_.pkey
           left outer join ADTestQ adtesta0_16_ on adtesta0_.pkey=adtesta0_16_.pkey
           left outer join ADTestR adtesta0_17_ on adtesta0_.pkey=adtesta0_17_.pkey
           left outer join ADTestS adtesta0_18_ on adtesta0_.pkey=adtesta0_18_.pkey
           left outer join ADTestT adtesta0_19_ on adtesta0_.pkey=adtesta0_19_.pkey
           left outer join ADTestU adtesta0_20_ on adtesta0_.pkey=adtesta0_20_.pkey
           left outer join ADTestV adtesta0_21_ on adtesta0_.pkey=adtesta0_21_.pkey
           left outer join ADTestW adtesta0_22_ on adtesta0_.pkey=adtesta0_22_.pkey
           left outer join ADTestX adtesta0_23_ on adtesta0_.pkey=adtesta0_23_.pkey
           left outer join ADTestY adtesta0_24_ on adtesta0_.pkey=adtesta0_24_.pkey
           left outer join ADTestZ adtesta0_25_ on adtesta0_.pkey=adtesta0_25_.pkey
 where adtesta0_.pkey=1

That's not very pretty, and does correspond to the effective simulation of the per-hierarchy table that I hoped could be avoided.

So it looks like these kind of queries are going to be very expensive. I'll have a think about how often they would be needed (compared, say, to knowing that I want an instance of ADTestP and asking for one of those right off the bat which only joins in the required parent tables). I have a feeling however that this will be unavoidable with references from othe entities; in other words, a one-to-one mapping from a field of type ADTestA is always going to involve exactly this kind of lookup.

(On the other hand the alternative strategies aren't shining beacons of hope either; going the table-per-hierarchy route and having literally hundreds of columns in a single table doesn't sound very efficient either...)

Andrzej Doyle
+1 for feeding back - thanks.
David M
A: 

As long as you access your DB only through Hibernate and you either dont have important data or are ready to write a small migration script, you should be able to make the decision about table per subclass / hierarchy quite late in your development process. That's the beauty of an ORM, it abstract the database structure...

On the other hand, I'm a big fan of "prefer composition over inheritance" (http://stackoverflow.com/questions/49002/prefer-composition-over-inheritance) and I am quite dubious that a model with 70 classes over 4-5 levels cannot be simplified ... but I'll let you think for yourself over that one, after all I dont know what probem you are trying to sovle.

Guillaume
Unfortunately the tables will *primarily* be accessed through Hibernate, but the design needs to be at least sane for other non-Java tools to access occasionally. And I may be confusing the relevance of concerns here, but the class hierarchy is being refactored at the same time that a new table is introduced, with DB performance being one of the primary motivators. I agree with the abstraction issue to some extent but as Joel wrote abstractions leak; if it's not possible to do this in a performant way in Hibernate, I ideally need to discover that as early as possible.
Andrzej Doyle
Oh - and I did consider writing a disclaimer about the number of subclasses being legitimate because I thought someone would highlight it as a smell. :-) Continuing the Animal analogy, I'm writing code for something akin to a zoo or vet so we legitimately have an `Animal` > `Vertebrate` > `Mammal` > `Carnivore` > `Canine` > `Dog` type hierarchy, for many different animals (and yes, there's useful behaviour at each level of the hierarchy). The Java code is much nicer and cleaner with this class structure than with the almost-flat one it replaces.
Andrzej Doyle