ansaurus

Question

RSS screen scraper

Answer 1

+1 A:

There's a good list of them here, which mentions Feed Parser, which you use like this:

import feedparser

python_wiki_rss_url = "http://www.python.org/cgi-bin/moinmoin/" \
                      "RecentChanges?action=rss_rc"

feed = feedparser.parse( python_wiki_rss_url )

You can then do things like:

for item in feed["items"]:
    print item["title"]

Dominic Rodger 2010-03-02 09:34:43

+1 for feedparser

S.Mark 2010-03-02 09:39:10

He was asking for a partial to full feed converter in python, not a parser.

Recursion 2010-03-02 09:47:34

Answer 2

+1 A:

feedparser.org is great

S.Mark 2010-03-02 09:35:29

+1 - and to you sir, think you got there a bit before me (my revision history doesn't show me posting the first link, reading it, seeing Feed Parser introduced there, and incorporating that into my post).

Dominic Rodger 2010-03-02 09:40:55

He was asking for a partial to full feed converter in python, not a parser.

Recursion 2010-03-02 09:46:41

probably you're right, but it will be a HTML scraping instead of RSS and Its completely **site dependent**, could even break site's policy, so let's use available RSS feeds :-)

S.Mark 2010-03-02 09:52:04

Answer 3

A:

Sorry but it doesn't exist in python, though they do in php. You are more then welcome to use and improve the one I made named scraped. Though it does not do all sites, it is a recipe based system that currently only handles the NYT, WSJ and the Economist. I am working on an all inclusive algorithm, but its a major undertaking. It includes a ton of analysis to the different types of html and xml. Even the 3 sites mentioned above, have vastly different algorithms on how to scrape their sites WSJ being the most complex by far. They screw their HTML up with so much useless crap, mainly to just stop you.

Here is the program I was talking about, it requires lxml but it explains everything in the readme. It reads the config files, parses partial rss feeds, takes links and then scrapes those links, formulating in the end a RSS 2.0 xml file. Which I mainly convert into a ebook for my kindle. I utilize lxml, BeautifulSoup and feedparser.

http://tinyurl.com/yh3s9pa

You can also look at the calibre project, which uses a similar method to the way I do it, on recipes.

Recursion 2010-03-02 09:43:45

ansaurus

tags:

views:

answers:

RSS screen scraper

related questions