tags:

views:

494

answers:

3

Hi,

Based on a previous stack overflow question and contribution by cgoldberg, I came up with this regex using the python re module:

import re
urls = re.finditer('http://(.*?).mp3', htmlcode)

The variable urls is an iterable object and I can use a loop to access each mp3 file url individually if there is more than one :

for url in urls:
    mp3fileurl = url.group(0)

This technique, however, only works sometimes. I realize regular expressions will not be as reliable as a fully fledged parser module. But, sometimes, this is not reliable for the same page.

I sometimes receive everything before http for some url entries.

I am relatively new to regular expressions. So, I am just wondering if there is a more reliable way to go about it.

Thanks in advance. New to stackoverflow and looking forward to contributing some answers as well.

+2  A: 

First, yeah, you should probably be using an HTML parser. Here's some sample code using the HTMLParser module that comes with Python:

from HTMLParser import HTMLParser

class ImgSrcHTMLParser(HTMLParser):
  def __init__(self):
    HTMLParser.__init__(self)
    self.srcs = []

  def handle_starttag(self, tag, attrs):
    if tag == 'img':
      self.srcs.append(dict(attrs).get('src'))

parser = ImgSrcHTMLParser()
parser.feed(html)
for src in parser.srcs:
  print src

This collects the src from img tags. It should be pretty easy to adapt it to your purposes assuming you want the href of 'a' tags that end in '.mp3'.

Assuming you really want to use a regex, there are some issues with your regex. You aren't delimiting the URL and you're using dot inside the URL. The worst side-effect of this is that a non-mp3 URL followed by an mp3-URL will be treated as one long URL. eg: "http://foo/bar.gif snarf snarf http://baz/quux.mp3". You probably want to require some kind of delimiter (spaces, quotes, depends on what you're doing) and disallow some characters inside URLs (probably the same characters and/or any characters that aren't allowed in URLs). Also, you forgot to escape the "." in ".mp3". So "http://foo/mp3icon.gif" will match as "http://foo/mp3".

Laurence Gonsalves
Thanks Laurence. This clears up a few things. I will give regular expressions a couple more tries (simply to learn how to properly use them) before going the parser route. But, you described the problem quite accurately: a non-mp3 URL followed by an mp3 URL will be treated as one long URL.
Ben Hast
+1  A: 

As always I suggest using a html parser like lxml.html instead of regular expressions to extract informations from html files:

import lxml.html

tree = lxml.html.fromstring(htmlcode)
for link in tree.findall(".//a"):
    url = link.get("href")
    if url.endswith(".mp3"):
        print url
Peter Hoffmann
Thanks Peter. I personally am a fan of lxml and what Ian Bicking has been doing too.
Ben Hast
Just a minor caveat to Peter's answer. The variable url is an object on which the endswith method (a string method) can not be declared. Simply convert the url into a string format, i.e. str(url), in order to use the endswith method
Ben Hast
+3  A: 

As pointed out by the other answers, using regular expressions to parse HTML = bad, bad idea.

With that in mind, I will add in code of my favorite parser: BeautifulSoup:

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(htmlcode)
links = soup.findAll('a', href=True)
mp3s = [l for l in links if l['href'].endswith('.mp3')]
for song in mp3s:
    print link['href']
Paolo Bergantino
Thanks Paolo. I used to resort to Beautiful Soup all the time before migrating to lxml. Surprised to see it require only the same amount of lines as lxml in this case.
Ben Hast
that should be `for l in mp3s: print l['href']`
Aaron Moodie