This is a mental excercise that has been bothering me for a while. What strategy would you use to solve this sort of problem?
Let's consider the following simple database structure. We have directories, obviously a tree of them. Also we have content items, which always reside in some directories.
create table directory (
directoryId integer generated always as identity primary key,
parentId integer default null,
directoryName varchar(100)
);
create table content (
contentId integer generated always as identity primary key,
directory integer references directory(directoryId),
contentTitle varchar(100),
contentText varchar(32000)
);
Now let's assume that our directory tree is massive and the amount of content is massive. The solution must scale well.
The main problem: How to efficiently retrieve all content items that are found from the specified directory and its subdirectories?
The way I see it SQL can not be used to get easily all the directoryIds for a subselect. Am I correct?
One could solve this at application side with simple recursive loop. That might become actually very heavy though and require tricky caching, especially to quarantee reasonable first access times.
One could also perhaps build a materialized query table and add multi-dimensional indexes dynamically for it. Possible but an implementation mess. Too complex.
My far most favorite solution would be probably to add a new table like
create table subdirectories (
directoryId integer,
subdirectoryId integer,
constraint thekey primary key (directoryId,subdirectoryId)
)
and make sure I would always update it manually when directories are being moved/deleted/created. Thus I could always do a select with the directoryId and get all Ids for subdirectories, including as a subselect for more complex queries. I also like the fact that the rdbms is able to optimize the queries well.
What do you guys think?