views:

334

answers:

4

Is there any way to select a subset from a large set based on a property or predicate in less than O(n) time?

For a simple example, say I have a large set of authors. Each author has a one-to-many relationship with a set of books, and a one-to-one relationship with a city of birth.

Is there a way to efficiently do a query like "get all books by authors who were born in Chicago"? The only way I can think of is to first select all authors from the city (fast with a good index), then iterate through them and accumulate all their books (O(n) where n is the number of authors from Chicago).

I know databases do something like this in certain joins, and Endeca claims to be able to do this "fast" using what they call "Record Relationship Navigation", but I haven't been able to find anything about the actual algorithms used or even their computational complexity.

I'm not particularly concerned with the exact data structure... I'd be jazzed to learn about how to do this in a RDBMS, or a key/value repository, or just about anything.

Also, what about third or fourth degree requests of this nature? (Get me all the books written by authors living in cities with immigrant populations greater than 10,000...) Is there a generalized n-degree algorithm, and what is its performance characteristics?

Edit:

I am probably just really dense, but I don't see how the inverted index suggestion helps. For example, say I had the following data:

DATA
1.  Milton        England
2.  Shakespeare   England
3.  Twain         USA

4.  Milton        Paridise Lost
5.  Shakespeare   Hamlet
6.  Shakespeare   Othello
7.  Twain         Tom Sawyer
8.  Twain         Huck Finn

INDEX
"Milton"         (1, 4)
"Shakespeare"    (2, 5, 6)
"Twain"          (3, 7, 8)
"Paridise Lost"  (4)
"Hamlet"         (5)
"Othello"        (6)
"Tom Sawyer"     (7)
"Huck Finn"      (8)
"England"        (1, 2)
"USA"            (3)

Say I did my query on "books by authors from England". Very quickly, in O(1) time via a hashtable, I could get my list of authors from England: (1, 2). But then, for the next step, in order retrieve the books, I'd have to, for EACH of the set {1, 2}, do ANOTHER O(1) lookup: 1 -> {4}, 2 -> {5, 6} then do a union of the results {4, 5, 6}.

Or am I missing something? Perhaps you meant I should explicitly store an index entry linking Book to Country. That works for very small data sets. But for a large data set, the number of indexes required to match any possible combination of queries would make the index grow exponentially.

A: 

Inverted Index.

Since this has a loop, I'm sure it fails the O(n) test. However, when your result set has n rows, it's impossible to avoid iterating over the result set. The query, however, is two hash lookups.

from collections import defaultdict

country = [ "England", "USA" ]

author=  [ ("Milton", "England"), ("Shakespeare","England"), ("Twain","USA") ]

title = [ ("Milton", "Paradise Lost"), 
    ("Shakespeare", "Hamlet"),
    ("Shakespeare", "Othello"),
    ("Twain","Tom Sawyer"),
    ("Twain","Huck Finn"),
]

inv_country = {}
for id,c in enumerate(country):
    inv_country.setdefault(c,defaultdict(list))
    inv_country[c]['country'].append( id )

inv_author= {}
for id,row in enumerate(author):
    a,c = row
    inv_author.setdefault(a,defaultdict(list))
    inv_author[a]['author'].append( id )
    inv_country[c]['author'].append( id )

inv_title= {}
for id,row in enumerate(title):
    a,t = row
    inv_title.setdefault(t,defaultdict(list))
    inv_title[t]['title'].append( id )
    inv_author[a]['author'].append( id )

#Books by authors from England
for t in inv_country['England']['author']:
    print title[t]
S.Lott
+1  A: 
SELECT a.*, b.*
   FROM Authors AS a, Books AS b
   WHERE a.author_id = b.author_id
     AND a.birth_city = "Chicago"
     AND a.birth_state = "IL";

A good optimizer will process that in less than the time it would take to read the whole list of authors and the whole list of books, which is sub-linear time, therefore. (If you have another definition of what you mean by sub-linear, speak out.)

Note that the optimizer should be able to choose the order in which to process the tables that is most advantageous. And this applies to N-level sets of queries.

Jonathan Leffler
Yes, but what are the algorithms/internal data structures that the optimizer uses? Does it still have to do a linear search <i>within</i> the Chicago authors?
levand
It depends on the indexing; if there's an index on birth_city (or birth_city and birth_state), then it will be able to use that to find the right authors; yes, it will do a linear scan via the index of the authors that were born in Chicago.
Jonathan Leffler
+1  A: 

Generally speaking, RDBMSes handle these types of queries very well. Both commercial and open source database engines have evolved over decades using all the reasonable computing algorithms applicable, to do just this task as fast as possible.

I would venture a guess that the only way you would beat RDBMS in speed is, if your data is specifically organized and require specific algorithms. Some RDBSes let you specify which of the underlying algorithms you can use for manipulating data, and with open-source ones, you can always rewrite or implement a new algorithm, if needed.

However, unless your case is very special, I believe it might be a serious overkill. For most cases, I would say putting the data in RDBMS and manipulating it via SQL should work well enough so that you don't have to worry abouut underlying algorithms.

Gnudiff
+2  A: 

For joins like this on large data sets, a modern RDBMS will often use an algorithm called a list merge. Using your example:

  1. Prepare a list, A, of all authors who live in Chicago and sort them by author in O(Nlog(N)) time.*
  2. Prepare a list, B, of all (author, book name) pairs and sort them by author in O(Mlog(M)) time.*
  3. Place these two lists "side by side", and compare the authors from the "top" (lexicographically minimum) element in each pile.
    • Are they the same? If so:
      • Output the (author, book name) pair from top(B)
      • Remove the top element of the B pile
      • Goto 3.
    • Otherwise, is top(A).author < top(B).author? If so:
      • Remove the top element of the A pile
      • Goto 3.
    • Otherwise, it must be that top(A).author > top(B).author:
      • Remove the top element of the B pile
      • Goto 3.

* (Or O(0) time if the table is already sorted by author, or has an index which is.)

The loop continues removing one item at a time until both piles are empty, thus taking O(N + M) steps, where N and M are the sizes of piles A and B respectively. Because the two "piles" are sorted by author, this algorithm will discover every matching pair. It does not require an index (although the presence of indexes may remove the need for one or both sort operations at the start).

Note that the RDBMS may well choose a different algorithm (e.g. the simple one you mentioned) if it estimates that it would be faster to do so. The RDBMS's query analyser generally estimates the costs in terms of disk accesses and CPU time for many thousands of different approaches, possibly taking into account such information as the statistical distributions of values in the relevant tables, and selects the best.

j_random_hacker