ansaurus

Question

I need a regex for the href attribute for an mp3 file url in python

Answer 1

+2 A:

First, yeah, you should probably be using an HTML parser. Here's some sample code using the HTMLParser module that comes with Python:

from HTMLParser import HTMLParser

class ImgSrcHTMLParser(HTMLParser):
  def __init__(self):
    HTMLParser.__init__(self)
    self.srcs = []

  def handle_starttag(self, tag, attrs):
    if tag == 'img':
      self.srcs.append(dict(attrs).get('src'))

parser = ImgSrcHTMLParser()
parser.feed(html)
for src in parser.srcs:
  print src

This collects the src from img tags. It should be pretty easy to adapt it to your purposes assuming you want the href of 'a' tags that end in '.mp3'.

Assuming you really want to use a regex, there are some issues with your regex. You aren't delimiting the URL and you're using dot inside the URL. The worst side-effect of this is that a non-mp3 URL followed by an mp3-URL will be treated as one long URL. eg: "http://foo/bar.gif snarf snarf http://baz/quux.mp3". You probably want to require some kind of delimiter (spaces, quotes, depends on what you're doing) and disallow some characters inside URLs (probably the same characters and/or any characters that aren't allowed in URLs). Also, you forgot to escape the "." in ".mp3". So "http://foo/mp3icon.gif" will match as "http://foo/mp3".

Laurence Gonsalves 2009-05-04 22:12:26

Thanks Laurence. This clears up a few things. I will give regular expressions a couple more tries (simply to learn how to properly use them) before going the parser route. But, you described the problem quite accurately: a non-mp3 URL followed by an mp3 URL will be treated as one long URL.

Ben Hast 2009-05-04 22:33:29

Answer 2

+1 A:

As always I suggest using a html parser like lxml.html instead of regular expressions to extract informations from html files:

import lxml.html

tree = lxml.html.fromstring(htmlcode)
for link in tree.findall(".//a"):
    url = link.get("href")
    if url.endswith(".mp3"):
        print url

Peter Hoffmann 2009-05-04 22:31:59

Thanks Peter. I personally am a fan of lxml and what Ian Bicking has been doing too.

Ben Hast 2009-05-04 22:48:18

Just a minor caveat to Peter's answer. The variable url is an object on which the endswith method (a string method) can not be declared. Simply convert the url into a string format, i.e. str(url), in order to use the endswith method

Ben Hast 2009-05-05 09:31:51

Answer 3

+3 A:

As pointed out by the other answers, using regular expressions to parse HTML = bad, bad idea.

With that in mind, I will add in code of my favorite parser: BeautifulSoup:

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(htmlcode)
links = soup.findAll('a', href=True)
mp3s = [l for l in links if l['href'].endswith('.mp3')]
for song in mp3s:
    print link['href']

Paolo Bergantino 2009-05-04 22:56:19

Thanks Paolo. I used to resort to Beautiful Soup all the time before migrating to lxml. Surprised to see it require only the same amount of lines as lxml in this case.

Ben Hast 2009-05-05 09:24:16

that should be `for l in mp3s: print l['href']`

Aaron Moodie 2010-05-18 14:06:27

ansaurus

tags:

views:

answers:

I need a regex for the href attribute for an mp3 file url in python

related questions