tags:

views:

272

answers:

3

I would need to programmatically determine whether an RSS feed exposes the full content of its articles or just extracts of them. How would you do it?

+1  A: 

Look for a link at the end that says "More", "Continued", "Full article", "..." or similar. Unless you want to follow every link on the page and look for the text from the feed plus extra perhaps.

Garry Shutler
+2  A: 

I don't think there is a very clean way of doing this, but here are two "hacky" ones:

I'd parse the RSS's text, and look for any links coming out of it. Granted, there could be multiple links there (some to other blog posts), but if you focus on the last one, and try to come up with a few heuristic words for the title of the link (i.e. "more", "read full", etc), you should be able to get a lot of them. For more confidence, you can only look at the links that point back to the original blog.

A more rigorous method would have you following all the links and trying to compare if the RSS fragment is a subset of the page that comes back, or if there is a substantial overlap. This may not help whenever the site uses a true summary as opposed to fragment of the full post though.

Jean Barmash
A: 

Why not follow the url from the rss-feed and check whether there is more text on this page than in the rss-feed? You would need take a html-parser and put in some general rules.

theomega
I think this might slow down the presentation of the app since it would be waiting for new network content.
Kevin Elliott