views:

622

answers:

4

I have an xml feed, say:

http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/

I want to get the list of hrefs for the videos:

 ['http://www.youtube.com/watch?v=aJvVkBcbFFY', 'ht....', ... ]
+1  A: 
import urllib
from xml.dom import minidom
xmldoc = minidom.parse(urllib.urlopen('http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/'))

links = xmldoc.getElementsByTagName('link')
hrefs = []
for links in link:
    if link.getAttribute('rel') == 'alternate':
        hrefs.append( link.getAttribute('href') )

hrefs
meder
Thanks for helping to get me going but this returns strange links like:http://m.youtube.com/watch?v=UIi-fANCngQ(it should be for link in links above too)
skyl
updated. you can do regex parsing on the inner if statement if you want a certain pattern, or do dom operations on what type of attributes the link element should have.
meder
isn't urllib deprecated?
jldupont
I'm confused about it's status as well. urllib2 is more often used by advanced Python 2.x users, however urllib is the module in Python 3...
meder
+3  A: 

Have a look at Universal Feed Parser, which is an open source RSS and Atom feed parser for Python.

Tim S. Van Haren
doh - nice find, hehe
meder
+3  A: 

In such a simple case, this should be enough:

import re, urllib2
request = urllib2.urlopen("http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/")
text = request.read()
videos = re.findall("http:\/\/www\.youtube\.com\/watch\?v=[\w-]+", text)

If you want to do more complicated stuff, parsing the XML will be better suited than regular expressions

piquadrat
Nice, that does indeed return the list I'm looking for.
skyl
+5  A: 
from xml.etree import cElementTree as ET
import urllib

def get_bass_fishing_URLs():
  results = []
  data = urllib.urlopen(
      'http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/')
  tree = ET.parse(data)
  ns = '{http://www.w3.org/2005/Atom}'
  for entry in tree.findall(ns + 'entry'):
    for link in entry.findall(ns + 'link'):
      if link.get('rel') == 'alternate':
        results.append(link.get('href'))

as it appears that what you get are the so-called "alternate" links. The many small, possible variations if you want something slightly different, I hope, should be clear from the above code (plus the standard Python library docs for ElementTree).

Alex Martelli
nice - I forgot about ElementTree. yours is much cleaner than mine :)
meder
thanks for getting me started with ET; your fully working example helped a lot. piquadrat gave me exactly what I asked for but then it turned out that I needed the info in <published> as well so this solution proved more adaptable in the wild.
skyl
@sos-sky, you're welcome -- always hard to strike the right balance between "giving a fish" and "teaching to fish", glad to hear that I've struck the right balance this time;-).
Alex Martelli