views:

172

answers:

3

I'm developing an rss feed reader that uses a bayesian filter to filter out boring blog posts.

The Stream table is meant to act as a FIFO buffer from which the webapp will consume 'entries'. I use it to store the temporary relationship between entries, users and bayesian filter classifications.

After a user marks an entry as read, it will be added to the metadata table (so that a user isn't presented with material they have already read), and deleted from the stream table. Every three minutes, a background process will repopulate the Stream table with new entries (i.e. whenever the daemon adds new entries after the checks the rss feeds for updates).

Problem: The query I came up with is hella slow. More importantly, the Stream table only needs to hold one hundred unread entries at a time; it'll reduce duplication, make processing faster and give me some flexibility with how I display the entries.

The query (takes about 9 seconds on 3600 items with no indexes):

insert into stream (entry_id, user_id) 
select entries.id, subscriptions_users.user_id 
 from entries 
inner join subscriptions_users on subscriptions_users.subscription_id = entries.subscription_id 
where subscriptions_users.user_id = 1 
  and entries.id not in (select entry_id 
                           from metadata 
                          where metadata.user_id = 1) 
  and entries.id not in (select entry_id 
                          from stream where user_id = 1);

The query explained: insert into stream all of the entries from a user's subscription list (subscriptions_users) that the user has not read (i.e. do not exist in metadata) and which do not already exist in the stream.

Attempted solution: adding limit 100 to the end speeds up the query considerably, but upon repeated executions will keep on adding a different set of 100 entries that do not already exist in the table (with each successful query taking longer and longer).

This is close but not quite what I wanted to do.

Does anyone have any advice (nosql?) or know a more efficient way of composing the query?

+1  A: 

The query (takes about 9 seconds on 3600 items with no indexes):

Then I would try to start off with some indexes...

OR LEFT JOIN NULL (And Indexes)

SELECT *
FROM TABLEA A LEFT JOIN
    TABLEB B ON A.ID = B. ID
WHERE B.ID IS NULL
astander
Oh, of course, but I still want to restrict the table membership to a hundred per user (assuming that's a good idea at all), and while my SQL is rusty it *feels* like the query is very inefficient to begin with.
hiffy
Yes. Sub selects *NOT IN* can be replaced with **left joins is null**. Also MySql LIMIT and MSSQL TOP restricts returned number of rows.
astander
OMG Ponies
Not in is slower, but not exists can be as fast if used with an outside ref.
astander
A: 

One way to optimize the select is to replace the subqueries with joins.

Something like:

select entries.id, subscriptions_users.user_id
from entries 
inner join subscriptions_users on subscriptions_users.subscription_id = entries.subscription_id 
left join metadata  md on (user_id,entry_id)
left join stream  str on (user_id, entry_id) 
where subscriptions_users.user_id = 1 and where md.user_id is null and str.user_id is null;

You would have to make sure that the join conditions for the left join are correct. I am not sure what your exact schema is, so I can't.

Also, adding indexes would also help.

Anatoly Fayngelerin
+1  A: 

Use:

INSERT INTO STREAM 
  (entry_id, user_id) 
   SELECT e.id, 
          su.user_id 
     FROM ENTRIES e
     JOIN SUBSCRIPTIONS_USERS su ON su.subscription_id = e.subscription_id 
                                AND su.user_id = 1 
LEFT JOIN METADATA md ON md.entry_id = e.id
                     AND md.user_id = 1
LEFT JOIN STREAM s ON s.entry_id = e.id
                  AND s.user_id = 1
    WHERE md.entry_id IS NULL
      AND s.entry_id IS NULL

In MySQL, the LEFT JOIN/IS NULL is the most efficient means of getting data that exists in one table, but not another. Reference link

Check the query performance before looking at indexes.

In Postgres:

  • NOT IN
  • NOT EXISTS
  • LEFT JOIN / IS NULL

...are equivalent.

OMG Ponies
Assuming the values Sequel Pro returns are accurate, on an empty stream table that insert takes less time (a total of 7seconds vs 12seconds) but on a filled table it takes longer (12seconds vs 8 seconds). I haven't had time to experiment with adding indexes - I would imagine it might improve; I'll report back later.Also: any ideas on how to limit it to filling it with only 1 hundred items :D?
hiffy
@hiffy: Try adding `LIMIT 100` to the end of the query. Could the `METADATA` and `STREAM` entry_id columns allow null values?
OMG Ponies
@OMG, adding LIMIT 100 will add a different hundred entries everytime it's executed until the unread entries are exhausted; I wanted it to insert one hundred at most everytime it's executed — a kind of 'replenishing' of the stream table. 30 entries have been read? This time around insert the top 30 that match. As to the schema, entry_ids can never be null, if that is what you asking.
hiffy
@hiffy: Use and `ORDER BY` clause to ensure consistency.
OMG Ponies
@OMG No dice. I added an `ORDER BY e.published DESC LIMIT 100` to the very end, and it keeps inserting a fresh one hundred entries (instead of the desired behaviour of 0 rows being affected, since no new rows were read). Am I misunderstanding something? It seems to be the only place I can syntactically place that restriction in the query. (Thanks so much for your help so far! I'd upvote you if I had the reputation).
hiffy
@hiffy: To stop duplicates from being inserted, you need to specify a column as a primary key.
OMG Ponies