The database schema is
CREATE TABLE sites
(
site_id INTEGER PRIMARY KEY AUTOINCREMENT,
netloc TEXT UNIQUE NOT NULL,
last_visited REAL DEFAULT 0,
crawl_rate REAL DEFAULT 2,
crawl_frequency REAL DEFAULT 604800,
robots_txt TEXT DEFAULT 0,
robots_last_fetch REAL DEFAULT 0,
allow_all NUMERIC DEFAULT 0,
disallow_all NUMERIC DEFAULT 0,
active NUMERIC DEFAULT 0
)
CREATE TABLE urls
(
url_id INTEGER PRIMARY KEY AUTOINCREMENT,
site_id INTEGER REFERENCES sites (id) NOT NULL,
scheme TEXT NOT NULL,
path TEXT NOT NULL,
last_visited REAL DEFAULT 0,
UNIQUE( site_id, scheme, path)
)
As you can probably, guess, this is for a web crawler.
I want to get N of the sites that have crawlable urls associated with them and all of the
aforementioned urls. A url is crawlable if url.last_visited + site.crawl_frequency < current_time
where current_time comes from pythons time.time()
function. What I'm looking for will probably begin with something like:
SELECT s.*, u.* FROM sites s, urls u ON s.site_id = u.site_id ...
Beyond that all I can think is that GROUP BY
might have some role to play.