ansaurus

Question

How to skip known entries when syncing with Google Reader?

Answer 1

+1 A:

The Google API hasn't yet been released, at which point this answer may change.

Currently, you would have to call the API and dis-regard items already downloaded, which as you said isn't terribly efficient as you will be re-downloading items every time, even if you already have them.

Sohnee 2009-06-15 12:10:47

Yes, it isn't efficient and this lack of efficiency gets worse with the number of articles you set the capacity of the client to, which pretty fast gets to an actual barrier. In NewsRob I limit the capacity to 500 articles. To get rid of this limitation I asked this question.As times goes by I doubt there will ever be an official release.

Mariano Kamp 2009-06-18 13:59:56

Answer 2

+5 A:

To get the latest entries, use the standard from-newest-date-descending download, which will start from the latest entries. You will receive a "continuation" token in the XML result, looking something like this:

<gr:continuation>CArhxxjRmNsC</gr:continuation>`

Scan through the results, pulling out anything new to you. You should find that either all results are new, or everything up to a point is new, and all after that are already known to you.

In the latter case, you're done, but in the former you need to find the new stuff older than what you've already retrieved. Do this by using the continuation to get the results starting from just after the last result in the set you just retrieved by passing it in the GET request as the c parameter, e.g.:

http://www.google.com/reader/atom/user/-/state/com.google/reading-list?c=CArhxxjRmNsC

Continue this way until you have everything.

The n parameter, which is a count of the number of items to retrieve, works well with this, and you can change it as you go. If the frequency of checking is user-set, and thus could be very frequent or very rare, you can use an adaptive algorithm to reduce network traffic and your processing load. Initially request a small number of the latest entries, say five (add n=5 to the URL of your GET request). If all are new, in the next request, where you use the continuation, ask for a larger number, say, 20. If those are still all new, either the feed has a lot of updates or it's been a while, so continue on in groups of 100 or whatever.

However, and correct me if I'm wrong here, you also want to know, after you've downloaded an item, whether its state changes from "unread" to "read" due to the person reading it using the Google Reader interface.

One approach to this would be:

Update the status on google of any items that have been read locally.
Check and save the unread count for the feed. (You want to do this before the next step, so that you guarantee that new items have not arrived between your download of the newest items and the time you check the read count.)
Download the latest items.
Calculate your read count, and compare that to google's. If the feed has a higher read count than you calculated, you know that something's been read on google.
If something has been read on google, start downloading read items and comparing them with your database of unread items. You'll find some items that google says are read that your database claims are unread; update these. Continue doing so until you've found a number of these items equal to the difference between your read count and google's, or until the downloads get unreasonable.
If you didn't find all of the read items, c'est la vie; record the number remaining as an "unfound unread" total which you also need to include in your next calculation of the local number you think are unread.

If the user subscribes to a lot of different blogs, it's also likely he labels them extensively, so you can do this whole thing on a per-label basis rather than for the entire feed, which should help keep the amount of data down, since you won't need to do any transfers for labels where the user didn't read anything new on google reader.

This whole scheme can be applied to other statuses, such as starred or unstarred, as well.

Now, as you say, this

...would mean that I need to keep my own read/unread state on the client and that the entries are already marked as read when the user logs on to the online version of Google Reader. That doesn't work for me.

True enough. Neither keeping a local read/unread state (since you're keeping a database of all of the items anyway) nor marking items read in google (which the API supports) seems very difficult, so why doesn't this work for you?

There is one further hitch, however: the user may mark something read as unread on google. This throws a bit of a wrench into the system. My suggestion there, if you really want to try to take care of this, is to assume that the user in general will be touching only more recent stuff, and download the latest couple hundred or so items every time, checking the status on all of them. (This isn't all that bad; downloading 100 items took me anywhere from 0.3s for 300KB, to 2.5s for 2.5MB, albeit on a very fast broadband connection.)

Again, if the user has a large number of subscriptions, he's also probably got a reasonably large number of labels, so doing this on a per-label basis will speed things up. I'd suggest, actually, that not only do you check on a per-label basis, but you also spread out the checks, checking a single label each minute rather than everything once every twenty minutes. You can also do this "big check" for status changes on older items less often than you do a "new stuff" check, perhaps once every few hours, if you want to keep bandwidth down.

This is a bit of bandwidth hog, mainly because you need to download the full article from Google merely to check the status. Unfortunately, I can't see any way around that in the API docs that we have available to us. My only real advice is to minimize the checking of status on non-new items.

Curt Sampson 2009-06-21 03:16:05

The problem is not that I can't specify a larger "n". I already use continuations, but the whole process of loading all the articles to find the changed few on the client side is highly inefficient. Please don't think about this as a one time thing, think about this happening every 20 minutes.

Mariano Kamp 2009-06-21 13:01:39

To get rid of this inefficiency is the reason I asked this question in the first place. As this approach just doesn't scale, I had to artificially restrict NewsRob to 500 articles. And I would really like to lift this limitation, but this would mean knowledge of this unofficial API and to find a way to let the filtering happen on the server side.Here I explained that in more detail:http://groups.google.com/group/newsrob/browse_thread/thread/8ac557e66833ca61

Mariano Kamp 2009-06-21 13:03:44

I absolutely am thinking about this happening every twenty minutes. My understanding was that you initially download all (or lots of) the items, and after that you want the new ones that have appeared since. This will do that, since the new ones will always be first, and you just go through until you hit one you've seen. If I've answered the wrong question, perhaps you can clarify yours? Feel free to e-mail me if you don't want to get into a big discussion here.

Curt Sampson 2009-06-21 13:12:59

Ah, the issue is this: "an item that you mark as read in theGoogle Reader web interface would even after a sync remain unread inNewsRob."

Curt Sampson 2009-06-21 13:16:00

Phew! Ok, I may have found a way to help you out here, if I now understand your real problem. It's not perfect, but it should be a lot more efficient than just downloading everything. Let me know if that is what you're looking for.

Curt Sampson 2009-06-21 13:37:30

ansaurus

tags:

views:

answers:

How to skip known entries when syncing with Google Reader?

related questions