I'm developing an rss feed reader that uses a bayesian filter to filter out boring blog posts.
The Stream table is meant to act as a FIFO buffer from which the webapp will consume 'entries'. I use it to store the temporary relationship between entries, users and bayesian filter classifications.
After a user marks an entry as read, it will be added to the metadata table (so that a user isn't presented with material they have already read), and deleted from the stream table. Every three minutes, a background process will repopulate the Stream table with new entries (i.e. whenever the daemon adds new entries after the checks the rss feeds for updates).
Problem: The query I came up with is hella slow. More importantly, the Stream table only needs to hold one hundred unread entries at a time; it'll reduce duplication, make processing faster and give me some flexibility with how I display the entries.
The query (takes about 9 seconds on 3600 items with no indexes):
insert into stream (entry_id, user_id)
select entries.id, subscriptions_users.user_id
from entries
inner join subscriptions_users on subscriptions_users.subscription_id = entries.subscription_id
where subscriptions_users.user_id = 1
and entries.id not in (select entry_id
from metadata
where metadata.user_id = 1)
and entries.id not in (select entry_id
from stream where user_id = 1);
The query explained: insert into stream all of the entries from a user's subscription list (subscriptions_users) that the user has not read (i.e. do not exist in metadata) and which do not already exist in the stream.
Attempted solution: adding limit 100 to the end speeds up the query considerably, but upon repeated executions will keep on adding a different set of 100 entries that do not already exist in the table (with each successful query taking longer and longer).
This is close but not quite what I wanted to do.
Does anyone have any advice (nosql?) or know a more efficient way of composing the query?