views:

40

answers:

1

The database schema is

CREATE TABLE sites
(
    site_id           INTEGER PRIMARY KEY AUTOINCREMENT,
    netloc            TEXT UNIQUE NOT NULL,
    last_visited      REAL DEFAULT 0,
    crawl_rate        REAL DEFAULT 2,
    crawl_frequency   REAL DEFAULT 604800,
    robots_txt        TEXT DEFAULT 0,
    robots_last_fetch REAL DEFAULT 0,
    allow_all         NUMERIC DEFAULT 0,
    disallow_all      NUMERIC DEFAULT 0,
    active            NUMERIC DEFAULT 0                           
 )

CREATE TABLE urls
(
     url_id       INTEGER PRIMARY KEY AUTOINCREMENT,
     site_id      INTEGER REFERENCES sites (id) NOT NULL,
     scheme       TEXT NOT NULL,
     path         TEXT NOT NULL,
     last_visited REAL DEFAULT 0,
     UNIQUE( site_id, scheme, path)                                   
 )

As you can probably, guess, this is for a web crawler.

I want to get N of the sites that have crawlable urls associated with them and all of the aforementioned urls. A url is crawlable if url.last_visited + site.crawl_frequency < current_time where current_time comes from pythons time.time() function. What I'm looking for will probably begin with something like:

SELECT s.*, u.* FROM sites s, urls u ON s.site_id = u.site_id ...

Beyond that all I can think is that GROUP BY might have some role to play.

+1  A: 

Here is a graceless query. There's probably a more clever way to do this.

SELECT s.*, u.* 
FROM sites s, urls u ON s.site_id = u.site_id
WHERE s.site_id IN 
    (SELECT DISTINCT site_id
     FROM urls uu INNER JOIN sites ss ON uu.site_id = ss.site_id
     WHERE uu.last_visited + ss.crawl_frequency < current_time 
     ORDER BY ss.site_id
     LIMIT n);

The subquery is supposed to return up to n distinct site_ids with least one crawlable URL. The ORDER BY attribute needn't be site_id. Actually ORDER BY isn't necessary at all. I just threw it in there because consistency is nice when playing with a new query.

The enclosing query returns all urls associated with n distinct sites, where each site has at least one crawlable url. Note that not all urls returned are necessarily crawlable; the only guarantee is that at least one url per site is crawlable. A returned site could have non-crawlable urls, too.

If only crawlable urls should be returned, the timing condition can be copied in the enclosing query. I couldn't tell which behavior was required from the question.

P.S. I'm indulging in pedantry now, but the way crawl_frequency is used makes me think it could be called crawl_period or crawl_delay instead

Dan LaRocque
Thanks, That's close enough to what I want to get the rest for myself. I'll edit my question later to give the full solution.
aaronasterling