ansaurus

Question

Simple scraping of youtube xml to get a Python list of videos

Answer 1

+1 A:

import urllib
from xml.dom import minidom
xmldoc = minidom.parse(urllib.urlopen('http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/'))

links = xmldoc.getElementsByTagName('link')
hrefs = []
for links in link:
    if link.getAttribute('rel') == 'alternate':
        hrefs.append( link.getAttribute('href') )

hrefs

meder 2009-09-20 22:19:08

Thanks for helping to get me going but this returns strange links like:http://m.youtube.com/watch?v=UIi-fANCngQ(it should be for link in links above too)

skyl 2009-09-20 22:26:36

updated. you can do regex parsing on the inner if statement if you want a certain pattern, or do dom operations on what type of attributes the link element should have.

meder 2009-09-20 22:37:40

isn't urllib deprecated?

jldupont 2009-09-21 01:12:18

I'm confused about it's status as well. urllib2 is more often used by advanced Python 2.x users, however urllib is the module in Python 3...

meder 2009-09-21 01:15:42

Answer 2

+3 A:

Have a look at Universal Feed Parser, which is an open source RSS and Atom feed parser for Python.

Tim S. Van Haren 2009-09-20 22:19:28

doh - nice find, hehe

meder 2009-09-20 22:20:50

Answer 3

+3 A:

In such a simple case, this should be enough:

import re, urllib2
request = urllib2.urlopen("http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/")
text = request.read()
videos = re.findall("http:\/\/www\.youtube\.com\/watch\?v=[\w-]+", text)

If you want to do more complicated stuff, parsing the XML will be better suited than regular expressions

piquadrat 2009-09-20 22:20:52

Nice, that does indeed return the list I'm looking for.

skyl 2009-09-20 22:28:29

Answer 4

+5 A:

from xml.etree import cElementTree as ET
import urllib

def get_bass_fishing_URLs():
  results = []
  data = urllib.urlopen(
      'http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/')
  tree = ET.parse(data)
  ns = '{http://www.w3.org/2005/Atom}'
  for entry in tree.findall(ns + 'entry'):
    for link in entry.findall(ns + 'link'):
      if link.get('rel') == 'alternate':
        results.append(link.get('href'))

as it appears that what you get are the so-called "alternate" links. The many small, possible variations if you want something slightly different, I hope, should be clear from the above code (plus the standard Python library docs for ElementTree).

Alex Martelli 2009-09-20 22:44:03

nice - I forgot about ElementTree. yours is much cleaner than mine :)

meder 2009-09-21 00:29:47

thanks for getting me started with ET; your fully working example helped a lot. piquadrat gave me exactly what I asked for but then it turned out that I needed the info in <published> as well so this solution proved more adaptable in the wild.

skyl 2009-09-21 14:31:29

@sos-sky, you're welcome -- always hard to strike the right balance between "giving a fish" and "teaching to fish", glad to hear that I've struck the right balance this time;-).

Alex Martelli 2009-09-21 14:45:48

ansaurus

tags:

views:

answers:

Simple scraping of youtube xml to get a Python list of videos

related questions