ansaurus

Question

Regular expression to extract URL from an HTML link

Answer 1

+13 A:

Don't use regexes, use BeautifulSoup. That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders. First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.

JosefAssad 2009-01-31 19:13:16

Answer 2

A:

this should work, although there might be more elegant ways.

import re
url='<a href="http://www.ptop.se" target="_blank">http://www.ptop.se&lt;/a&gt;'
r = re.compile('(?<=href=").*?(?=")')
r.findall(url)

2009-01-31 19:16:03

Answer 3

+6 A:

If you're only looking for one:

import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
    print match.group(0)

If you have a long string, and want every instance of the pattern in it:

import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print ', '.join(urls)

Where s is the string that you're looking for matches in.

Quick explanation of the regexp bits:

r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)

"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.

Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."

"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.

The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.

It's pretty easy to do:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
    print tag['href']

Once you've installed BeautifulSoup, anyway.

David 2009-01-31 19:17:06

Part of learning regexes is learning when not to use them, this is a case where you shouldn't use them.

Chas. Owens 2009-05-13 14:29:33

Answer 4

A:

There's tonnes of them on regexlib

Chris S 2009-01-31 19:34:19

Answer 5

+1 A:

Yes, there are tons of them on regexlib. That only proves that RE's should not be used to do that. Use SGMLParser or BeautifulSoup or write a parser - but don't use RE's. The ones that seems to work are extremely compliated and still don't cover all cases.

2009-05-13 14:22:29

Answer 6

+2 A:

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

In particular you will want to look at the Python answers: BeautifulSoup, HTMLParser, and lxml.

Chas. Owens 2009-05-13 14:38:27

Answer 7

+1 A:

John Gruber (who wrote Markdown, which is made of regular expressions and is used right here on Stack Overflow) had a go at producing a regular expression that recognises URLs in text:

http://daringfireball.net/2009/11/liberal_regex_for_matching_urls

If you just want to grab the URL (i.e. you’re not really trying to parse the HTML), this might be more lightweight than an HTML parser.

Paul D. Waite 2009-11-27 23:37:54

Answer 8

A:

This improved version should work as reliably as a parser.

   // Applies to URI, not just URL or URN:
   //    http://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Relationship_to_URL_and_URN
   //
   // http://labs.apache.org/webarch/uri/rfc/rfc3986.html#regexp
   //
   // (?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?
   //
   // http://en.wikipedia.org/wiki/URI_scheme#Generic_syntax
   //
   // $@ matches the entire uri
   // $1 matches scheme (ftp, http, mailto, mshelp, ymsgr, etc)
   // $2 matches authority (host, user:pwd@host, etc)
   // $3 matches path
   // $4 matches query (http GET REST api, etc)
   // $5 matches fragment (html anchor, etc)
   //
   // Match specific schemes, non-optional authority, disallow white-space so can delimit in text, and allow 'www.' w/o scheme
   // Note the schemes must match ^[^\s|:/?#]+(?:\|[^\s|:/?#]+)*$
   //
   // (?:()(www\.[^\s/?#]+\.[^\s/?#]+)|(schemes)://([^\s/?#]*))([^\s?#]*)(?:\?([^\s#]*))?(#(\S*))?
   //
   // Validate the authority with an orthogonal RegExp, so the RegExp above won’t fail to match any valid urls.
   function uriRegExp( flags, schemes/* = null*/, noSubMatches/* = false*/ )
   {
      if( !schemes )
         schemes = '[^\\s:\/?#]+'
      else if( !RegExp( /^[^\s|:\/?#]+(?:\|[^\s|:\/?#]+)*$/ ).test( schemes ) )
         throw TypeError( 'expected URI schemes' )
      return noSubMatches ? new RegExp( '(?:www\\.[^\\s/?#]+\\.[^\\s/?#]+|' + schemes + '://[^\\s/?#]*)[^\\s?#]*(?:\\?[^\\s#]*)?(?:#\\S*)?', flags ) :
         new RegExp( '(?:()(www\\.[^\\s/?#]+\\.[^\\s/?#]+)|(' + schemes + ')://([^\\s/?#]*))([^\\s?#]*)(?:\\?([^\\s#]*))?(?:#(\\S*))?', flags )
   }

   // http://en.wikipedia.org/wiki/URI_scheme#Official_IANA-registered_schemes
   function uriSchemesRegExp()
   {
      return 'about|callto|ftp|gtalk|http|https|irc|ircs|javascript|mailto|mshelp|sftp|ssh|steam|tel|view-source|ymsgr'
   }

Shelby Moore 2010-09-16 07:30:49

ansaurus

tags:

views:

answers:

Regular expression to extract URL from an HTML link

related questions