views:

153

answers:

3

I have a table full of items from different sources. Some of the sources might have the same location (in my example, different BBC news feeds would be different sources, but they all come from the BBC). Each item has a "unique" ID which can be used to identify it among others from the same location. This means that items relating to the same news story on a site but published under different feeds will have the same "unique ID", but that's not necessarily globally unique.

The problem is that I want to eliminate duplicates at display time, so that (depending which feeds you're seeing) you only get at most one version of each story, even though two or three of your feeds might contain links to it.

I have a sources table with information about each source, and location_id and location_precedence fields. I then have an items table that contains each item, its unique_id, source_id, and content. Items with the same unique_id and source location_id should appear at most once, with the highest source location_precedence winning.

I would have thought that something like:

SELECT `sources`.`name` AS `source`,
       `items`.`content`,
       `items`.`published`
FROM `items` INNER JOIN `sources`
  ON `items`.`source_id` = `sources`.`id` AND `sources`.`active` = 1
GROUP BY `items`.`unique_id`, `sources`.`location_id`
ORDER BY `sources`.`location_priority` DESC

would do the trick, but that seems to ignore the location priority field. What have I missed?


Example data:

CREATE TABLE IF NOT EXISTS `sources` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `location_id` int(10) unsigned NOT NULL,
  `location_priority` int(11) NOT NULL,
  `active` tinyint(1) unsigned NOT NULL default '1',
  `name` varchar(150) NOT NULL,
  `url` text NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `active` (`active`)
);

INSERT INTO `sources` (`id`, `location_id`, `location_priority`, `active`, `name`, `url`) VALUES
(1, 1, 25, 1, 'BBC News Front Page', 'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml'),
(2, 1, 10, 1, 'BBC News England', 'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/england/rss.xml'),
(3, 1, 15, 1, 'BBC Technology News', 'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/technology/rss.xml'),
(4, 2, 0, 1, 'Slashdot', 'http://rss.slashdot.org/Slashdot/slashdot'),
(5, 3, 0, 1, 'The Daily WTF', 'http://syndication.thedailywtf.com/TheDailyWtf');

CREATE TABLE IF NOT EXISTS `items` (
  `id` bigint(20) unsigned NOT NULL auto_increment,
  `source_id` int(10) unsigned NOT NULL,
  `published` datetime NOT NULL,
  `content` text NOT NULL,
  `unique_id` varchar(255) NOT NULL,
  PRIMARY KEY  (`id`),
  UNIQUE KEY `unique_id` (`unique_id`,`source_id`),
  KEY `published` (`published`),
  KEY `source_id` (`source_id`)
);

INSERT INTO `items` (`id`, `source_id`, `published`, `content`, `unique_id`) VALUES
(1,  1, '2009-12-01 16:25:53', 'Story about Subject One',                     'abc'),
(2,  2, '2009-12-01 16:21:31', 'Subject One in story',                        'abc'),
(3,  3, '2009-12-01 16:17:20', 'Techy goodness',                              'def'),
(4,  2, '2009-12-01 16:05:57', 'Further updates on Foo case',                 'ghi'),
(5,  3, '2009-12-01 15:53:39', 'Foo, Bar and Quux in court battle',           'ghi'),
(6,  2, '2009-12-01 15:52:02', 'Anti-Fubar protests cause disquiet',          'mno'),
(7,  4, '2009-12-01 15:39:00', 'Microsoft Bleh meets lukewarm reception',     'pqr'),
(8,  5, '2009-12-01 15:13:45', 'Ever thought about doing it in VB?',          'pqr'),
(9,  1, '2009-12-01 15:13:15', 'Celebrity has 'new friend'',        'pqr'),
(10, 1, '2009-12-01 15:09:57', 'Microsoft launches Bleh worldwide',           'stu'),
(11, 2, '2009-12-01 14:57:22', 'Microsoft launches Bleh in UK',               'stu'),
(12, 3, '2009-12-01 14:57:22', 'Microsoft launches Bleh',                     'stu'),
(13, 3, '2009-12-01 14:42:15', 'Tech round-up',                               'vwx'),
(14, 2, '2009-12-01 14:36:26', 'Estates 'old news' say government', 'yza'),
(15, 1, '2009-12-01 14:15:21', 'Iranian doctor 'was poisoned'',     'bcd'),
(16, 4, '2009-12-01 14:14:02', 'Apple fans overjoyed by iBlah',               'axf');

Expected content after query:

  • Story about Subject One
  • Techy goodness
  • Foo, Bar and Quux in court battle
  • Anti-Fubar protests cause disquiet
  • Microsoft Bleh meets lukewarm reception
  • Ever thought about doing it in VB?
  • Celebrity has 'new friend'
  • Microsoft launches Bleh worldwide
  • Tech round-up
  • Estates 'old news' say government
  • Iranian doctor 'was poisoned'
  • Apple fans overjoyed by iBlah


I've tried a variation on the solution by Andomar, with some success:

SELECT      s.`name` AS `source`,
            i.`content`,
            i.`published`
FROM        `items` i
INNER JOIN  `sources` s
ON          i.`source_id` = s.`id`
AND         s.`active` = 1
INNER JOIN (
  SELECT `unique_id`, `source_id`, MAX(`location_priority`) AS `prio` 
  FROM `items` i
  INNER JOIN `sources` s ON s.`id` = i.`source_id` AND s.`active` = 1
  GROUP BY `location_id`, `unique_id`
) `filter`
ON          i.`unique_id` = `filter`.`unique_id`
AND         s.`location_priority` = `filter`.`prio`
ORDER BY    i.`published` DESC
LIMIT 50

With AND s.location_priority = filter.prio things almost work as I want. Because an item can come from multiple sources with the same priority, items can be repeated. In this case, an extra GROUP BY i.unique_id on the outer query does the job, and I suppose it doesn't matter which source "wins" if priorities are equal.

I had tried with AND i.source_id = filter.source_id instead, which almost works (i.e. eliminates the extra GROUP BY) but doesn't give results from the right sources. In the example above, it gives me "Further updates on Foo case" (source "BBC News England") rather than "Foo, Bar and Quux in court battle" (source "BBC Technology News". Looking at the results of the inner query, I get:

unique_id: 'ghi'
source_id: 2
prio: 15

Note that the source ID is not correct (expected: 3).

+1  A: 

do a self join to a derived table like

select max(location_priority) from table where ...
Paul Creasey
A: 

What have I missed?

The ORDER BY happens after the GROUP BY has already reduced each group to a single row. Paul gives one resolution.

As for the problem with the query:

SELECT `unique_id`, `source_id`, MAX(`location_priority`) AS `prio` 
FROM `items` i
INNER JOIN `sources` s ON s.`id` = i.`source_id` AND s.`active` = 1
GROUP BY `location_id`, `unique_id`

source_id is neither an aggregate nor grouped. As a result, which value you get is indeterminate.

outis
This won't work: you can't use a non-aggregated column in the HAVING clause. Even if you could, this would hide all stories that have an inactive source with a high priority.
Andomar
@Andormar: in MySQL, you can. The join ensures that highest-priority inactive sources are never under consideration. The real problem is that HAVING apparently filters after GROUP BY has reduced the rows.
outis
@outis: I think you can use them in SELECT, but in `HAVING` they give an `unknown column` error
Andomar
@Andomar: it gave no error when I tried it with MySQL 5.1.41. What verion did you test on?
outis
@Andomar: also, see http://dev.mysql.com/doc/refman/5.0/en/group-by-hidden-columns.html. Unless ONLY_FULL_GROUP_BY mode is enabled, non-aggregate columns can be used in the HAVING clause.
outis
I hadn't thought about the order of operations... to me, it made sense to use ordering within groups! Thanks :-)
Dave
@outis: I'm running 5.1.37, but you're right, must've been a typo. Note that the MySQL doc says the results are indeterminate: "This extension assumes that the nongrouped columns will have the same group-wise values. Otherwise, the result is indeterminate."
Andomar
@Andomar: Yup, because the `HAVING` clause comes into effect after the `GROUP BY`.
outis
@Andomar: looks like we've reached the point where we're saying the same thing in different ways.
outis
+3  A: 

Order by merely orders the rows, it doesn't pick among them.

One of the ways to filter out rows with a lower location_priority is to use an inner join as filter:

SELECT     s.name, i.content, i.published
FROM       items i 
INNER JOIN sources s
ON         i.source_id = s.id
AND        s.active = 1
INNER JOIN (
    SELECT unique_id, max(location_priority) as prio
    FROM items i
    INNER JOIN sources s ON s.id = i.source_id AND s.active = 1
    GROUP BY unique_id) filter
ON         i.unique_id = filter.unique_id
AND        s.location_priority = filter.prio;

An alternative is a where ... in <subquery> clause, for example:

SELECT     s.name, i.content, i.published
FROM       items i 
INNER JOIN sources s
ON         i.source_id = s.id
AND        s.active = 1
WHERE      (i.unique_id, s.location_priority) IN (
    SELECT unique_id, max(location_priority)
    FROM items i
    INNER JOIN sources s ON s.id = i.source_id AND s.active = 1
    GROUP BY unique_id
);

This problem is also known as "Selecting records holding a group-wide maximum." Quassnoi has written a nice article on it.

EDIT: One way to break ties with multiple sources at the same priority is a WHERE clause with a subquery. This example breaks ties on i.id DESC:

SELECT     s.name, i.unique_id, i.content, i.published
FROM       (
           SELECT unique_id, min(location_priority) as prio
           FROM items i
           INNER JOIN sources s ON s.id = i.source_id AND s.active = 1
           GROUP BY unique_id
           ) filter
JOIN       items i
JOIN       sources s
ON         s.id = i.source_id 
           AND s.active = 1
WHERE      i.id =
           (
           SELECT   i.id
           FROM     items i
           JOIN     sources s 
           ON       s.id = i.source_id 
                    AND s.active = 1
           WHERE    i.unique_id = filter.unique_id
           AND      s.location_priority = filter.prio
           ORDER BY i.id DESC
           LIMIT 1
           )

Quassnoi also has an article on selecting records holding group-wise maximum (resolving ties) :)

Andomar
Thanks! The article (and knowing how to describe the problem) is very useful.
Dave
See also: http://dev.mysql.com/doc/refman/5.1/en/example-maximum-column-group-row.html
outis
Argh. So I've tried this solution, but it doesn't seem to be working. I've updated the main post with details.
Dave