tags:

views:

369

answers:

5

I have a bunch of URLs stored in a table waiting to be scraped by a script. However, many of those URLs are from the same site. I would like to return those URLs in a "site-friendly" order (that is, try to avoid two URLs from the same site in a row) so I won't be accidentally blocked by making too many http requests in a short time.

The database layout is something like this:

create table urls (
    site varchar,       -- holds e.g. www.example.com or stockoverflow.com
    url varchar unique
);
Example result:
SELECT url FROM urls ORDER BY mysterious_round_robin_function(site);

http://www.example.com/some/file
http://stackoverflow.com/questions/ask
http://use.perl.org/
http://www.example.com/some/other/file
http://stackoverflow.com/tags

I thought of something like "ORDER BY site <> @last_site DESC" but I have no idea how to go about writing something like that.

+3  A: 

I think you're overcomplicating this. Why not just use

ORDER BY NewID()

Nissan Fan
Nice! Beautifully simplistic.
Deinumite
Is that similar to RANDOM()? If it is, it's not quite good enough. There are a couple of sites that comprise 10-30% of the table.
hhaamu
@hhaamu: I think that you're underestimating the power of RANDOM :
wuub
+2  A: 

See this article in my blog for more detailed explanations on how it works:

With new PostgreSQL 8.4:

SELECT  *
FROM    (
        SELECT  site, url, ROW_NUMBER() OVER (PARTITION BY site ORDER BY url) AS rn
        FROM    urls
        )
ORDER BY
        rn, site

With elder versions:

SELECT  site,
        (
        SELECT  url
        FROM    urls ui
        WHERE   ui.site = sites.site
        ORDER BY
                url
        OFFSET  total
        LIMIT   1
        ) AS url
FROM    ( 
        SELECT  site, generate_series(0, cnt - 1) AS total
        FROM    (
                SELECT  site, COUNT(*) AS cnt
                FROM    urls
                GROUP BY
                        site
                ) s
        ) sites
ORDER BY
        total, site

, though it can be less efficient.

Quassnoi
The last query *really* needs to be checked for efficiency if your table is large.
Quassnoi
Thanks, it works perfectly! Looks like voodoo to me, but I'll figure it out someday.
hhaamu
I'm still using 8.3, but the table is only ~150 rows so far.
hhaamu
@hhaamu: note that the query time for the last query will grow exponentially. I'd test ot on the maximum number of records you are planning to achieve.
Quassnoi
@Quassnoi: Will do. Very much doubt there will be 1000 rows. And the query won't be run all that often (once a day or more seldom).
hhaamu
+1  A: 

You are asking for round-robin, but I think a simple

SELECT site, url FROM urls ORDER BY RANDOM()

will do the trick. It should work even if urls from the same site are clustered in db.

wuub
A: 

If the URLs don't change very often, you can come up with a somewhat-complicated job that you could run periodically (nightly?) which would assign integers to each record based on the different sites present.

What you can do is write a routine that parses the domain out from a URL (you should be able to find a snippet that does this nearly anywhere).

Then, you create a temporary table that contains each unique domain, plus a number.

Then, for every record in your URLs table, you look up the domain in your temp table, assign that record the number stored there, and add a large number to that temp table's number.

Then for the rest of the day, sort by the number.


Here's an example with the five records you used in your question:

URLs:

Temp table:

example.com       1
stackoverflow.com 2
perl.org          3

Then for each URL, you look up the value in the temp table, and add 3 to it (because it's got 3 distinct records):

Iteration 1:

URLs:

http://www.example.com/some/file         1
http://www.example.com/some/other/file   NULL
http://stackoverflow.com/questions/ask   NULL
http://stackoverflow.com/tags            NULL
http://use.perl.org/                     NULL

Temp table:

example.com       4
stackoverflow.com 2
perl.org          3

Iteration 2:

URLs:

http://www.example.com/some/file         1
http://www.example.com/some/other/file   4
http://stackoverflow.com/questions/ask   NULL
http://stackoverflow.com/tags            NULL
http://use.perl.org/                     NULL

Temp table:

example.com       7
stackoverflow.com 2
perl.org          3

et cetera until you get to

http://www.example.com/some/file         1
http://www.example.com/some/other/file   4
http://stackoverflow.com/questions/ask   2
http://stackoverflow.com/tags            5
http://use.perl.org/                     3

For a lot of records, it's going to be slow. And it will be difficult to work with many inserts/deletions, but the result will be a flawless round-robin ordering.

Welbog
A: 

There is a much simpler and faster solution...

  • add a sort_order column of type TEXT
  • add an ON INSERT trigger which sets sort_order to md5( url )
  • index on sort_order
  • grab the rows in (sort_order, primary key) order

-> it's very fast and indexed -> rows will come in a repeatable, yet random order

peufeu