views:

267

answers:

1

Is there any way to read a collection of extension elements with Universal Feed Parser?

This is just a short snippet from Kuler RSS feed:

<channel>
  <item>
    <!-- snip: regular RSS elements -->
    <kuler:themeItem>
      <kuler:themeID>123456</kuler:themeID>
      <!-- snip -->
      <kuler:themeSwatches>
        <kuler:swatch>
          <kuler:swatchHexColor>FFFFFF</kuler:swatchHexColor>
          <!-- snip -->
        </kuler:swatch>
        <kuler:swatch>
          <kuler:swatchHexColor>000000</kuler:swatchHexColor>
          <!-- snip -->
        </kuler:swatch>
      </kuler:themeSwatches>
    </kuler:themeItem>
  </item>
</channel>

I tried the following:

>>> feed = feedparser.parse(url)
>>> feed.channel.title
u'kuler highest rated themes'
>>> feed.entries[0].title
u'Foobar'
>>> feed.entries[0].kuler_themeid
u'123456'
>>> feed.entries[0].kuler_swatch
u''

feed.entries[0].kuler_swatchhexcolor returns only last kuler:swatchHexColor. Is there any way to retrieve all elements with feedparser?

I have already worked around the issue by using minidom, but I would like to use Universal Feed Parser if possible (due to very simple API). Can it be extended? I haven't found anything about that in the documentation, so if someone has more experience with the library, please, advise me.

+2  A: 

Universal Feed Parser is really nice for most feeds, but for extended feeds, you might wanna try something called BeautifulSoup. It's an XML/HTML/XHTML parsing library which is originally designed for screenscraping; turns out it's also brilliant for this sort of thing. The documentation is pretty good, and it's got a self-explanatory API, so if you're thinking of using anything else, that's what I'd recommend.

I'd probably use it like this:

>>> import BeautifulSoup
>>> import urllib2

# Fetch HTML data from url
>>> connection = urllib2.urlopen('http://kuler.adobe.com/path/to/rss.xml')
>>> html_data = connection.read()
>>> connection.close()

# Create and search the soup
>>> soup = BeautifulSoup.BeautifulSoup(html_data)
>>> themes = soup.findAll('kuler:themeitem') # Note: all lower-case element names

# Get the ID of the first theme
>>> themes[0].find('kuler:themeid').contents[0]
u'123456'

# Get an ordered list of the hex colors for the first theme
>>> themeswatches = themes[0].find('kuler:themeswatches')
>>> colors = [color.contents[0] for color in
... themeswatches.findAll('kuler:swatchhexcolor')]
>>> colors
[u'FFFFFF', u'000000']

So you can probably get the idea that this is a very cool library. It wouldn't be too good if you were parsing any old RSS feed, but because the data is from Adobe Kuler, you can be pretty sure that it's not going to vary enough to break your app (i.e. it's a trusted enough source).

Even worse is trying to parse Adobe's goddamn .ASE format. I tried writing a parser for it and it got really horrible, really quickly. Ug. So, yeah, the RSS feeds are probably the easiest way of interfacing with Kuler.

zvoase
Thanks, I'll check that too. The API seems a bit easier than it is with minidom: I'd choose find/findAll vs. getElementsByTagName any day :)
Damir Zekić