ansaurus

Question

Data structure or algorithm for second degree lookups in sub-linear time?

Answer 1

A:

Since this has a loop, I'm sure it fails the O(n) test. However, when your result set has n rows, it's impossible to avoid iterating over the result set. The query, however, is two hash lookups.

from collections import defaultdict

country = [ "England", "USA" ]

author=  [ ("Milton", "England"), ("Shakespeare","England"), ("Twain","USA") ]

title = [ ("Milton", "Paradise Lost"), 
    ("Shakespeare", "Hamlet"),
    ("Shakespeare", "Othello"),
    ("Twain","Tom Sawyer"),
    ("Twain","Huck Finn"),
]

inv_country = {}
for id,c in enumerate(country):
    inv_country.setdefault(c,defaultdict(list))
    inv_country[c]['country'].append( id )

inv_author= {}
for id,row in enumerate(author):
    a,c = row
    inv_author.setdefault(a,defaultdict(list))
    inv_author[a]['author'].append( id )
    inv_country[c]['author'].append( id )

inv_title= {}
for id,row in enumerate(title):
    a,t = row
    inv_title.setdefault(t,defaultdict(list))
    inv_title[t]['title'].append( id )
    inv_author[a]['author'].append( id )

#Books by authors from England
for t in inv_country['England']['author']:
    print title[t]

S.Lott 2009-01-09 02:01:08

Answer 2

+1 A:

SELECT a.*, b.*
   FROM Authors AS a, Books AS b
   WHERE a.author_id = b.author_id
     AND a.birth_city = "Chicago"
     AND a.birth_state = "IL";

A good optimizer will process that in less than the time it would take to read the whole list of authors and the whole list of books, which is sub-linear time, therefore. (If you have another definition of what you mean by sub-linear, speak out.)

Note that the optimizer should be able to choose the order in which to process the tables that is most advantageous. And this applies to N-level sets of queries.

Jonathan Leffler 2009-01-09 02:42:00

Yes, but what are the algorithms/internal data structures that the optimizer uses? Does it still have to do a linear search <i>within</i> the Chicago authors?

levand 2009-01-09 02:51:55

It depends on the indexing; if there's an index on birth_city (or birth_city and birth_state), then it will be able to use that to find the right authors; yes, it will do a linear scan via the index of the authors that were born in Chicago.

Jonathan Leffler 2009-01-09 05:33:25

Answer 3

+1 A:

Generally speaking, RDBMSes handle these types of queries very well. Both commercial and open source database engines have evolved over decades using all the reasonable computing algorithms applicable, to do just this task as fast as possible.

I would venture a guess that the only way you would beat RDBMS in speed is, if your data is specifically organized and require specific algorithms. Some RDBSes let you specify which of the underlying algorithms you can use for manipulating data, and with open-source ones, you can always rewrite or implement a new algorithm, if needed.

However, unless your case is very special, I believe it might be a serious overkill. For most cases, I would say putting the data in RDBMS and manipulating it via SQL should work well enough so that you don't have to worry abouut underlying algorithms.

Gnudiff 2009-01-09 10:27:57

Answer 4

+2 A:

For joins like this on large data sets, a modern RDBMS will often use an algorithm called a list merge. Using your example:

Prepare a list, A, of all authors who live in Chicago and sort them by author in O(Nlog(N)) time.*
Prepare a list, B, of all (author, book name) pairs and sort them by author in O(Mlog(M)) time.*
Place these two lists "side by side", and compare the authors from the "top" (lexicographically minimum) element in each pile.
- Are they the same? If so:
  - Output the (author, book name) pair from top(B)
  - Remove the top element of the B pile
  - Goto 3.
- Otherwise, is top(A).author < top(B).author? If so:
  - Remove the top element of the A pile
  - Goto 3.
- Otherwise, it must be that top(A).author > top(B).author:
  - Remove the top element of the B pile
  - Goto 3.

* (Or O(0) time if the table is already sorted by author, or has an index which is.)

The loop continues removing one item at a time until both piles are empty, thus taking O(N + M) steps, where N and M are the sizes of piles A and B respectively. Because the two "piles" are sorted by author, this algorithm will discover every matching pair. It does not require an index (although the presence of indexes may remove the need for one or both sort operations at the start).

Note that the RDBMS may well choose a different algorithm (e.g. the simple one you mentioned) if it estimates that it would be faster to do so. The RDBMS's query analyser generally estimates the costs in terms of disk accesses and CPU time for many thousands of different approaches, possibly taking into account such information as the statistical distributions of values in the relevant tables, and selects the best.

j_random_hacker 2009-01-09 10:57:55

ansaurus

tags:

views:

answers:

Data structure or algorithm for second degree lookups in sub-linear time?

related questions