ansaurus

Question

Regular expression for extracting tag attributes

Answer 1

+1 A:

I suggest that you use HTML Tidy to convert the HTML to XHTML, and then use a suitable XPath expression to extract the attributes.

divideandconquer.se 2008-11-25 11:27:38

Answer 2

+1 A:

If you want to be general, you have to look at the precise specification of the a tag, like here. But even with that, if you do your perfect regexp, what if you have malformed html?

I would suggest to go for a library to parse html, depending on the language you work with: e.g. like python's Beautiful Soup.

Piotr Lesnicki 2008-11-25 11:30:20

Answer 3

+8 A:

Token Mantra response: you should not tweak/modify/harvest/or otherwise produce html/xml using regular expression.

there are too may corner case conditionals such as \' and \" which must be accounted for. You are much better off using a proper DOM Parser, XML Parser, or one of the many other dozens of tried and tested tools for this job instead of inventing your own.

I don't really care which one you use, as long as its recognized, tested, and you use one.

my $foo  = Someclass->parse( $xmlstring ); 
my @links = $foo->getChildrenByTagName("a"); 
my @srcs = map { $_->getAttribute("src") } @links; 
# @srcs now contains an array of src attributes extracted from the page.

Kent Fredric 2008-11-25 11:33:25

"corner case conditionals such as \' and \" which must be accounted for" ... you can't escape quotes in a HTML attribute. The only way to include them is to encode them as an entity, "

nickf 2008-11-25 11:55:14

Yes, the specification of HTML states you should entity encode them, but however, due to people *using* backslashing browsers adapt to make it work, and more people use it,thus, your parser must be able to handle it when they do :)

Kent Fredric 2009-01-17 02:28:48

Answer 4

+1 A:

If youre in .NET I recommend the HTML agility pack, very robust even with malformed HTML.

Then you can use XPath.

Andrew Bullock 2008-11-25 11:36:27

Answer 5

+4 A:

If you have an element like

<name attribute=value attribute="value" attribute='value'>

this regex could be used to find successively each attribute name and value

(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?

Applied on:

<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href='test.html' class="xyz">

it would yield:

'href' => 'test.html'
'class' => 'xyz'

VonC 2008-11-25 11:37:43

What about “foo="bar' bar='bla"”?

Gumbo 2009-02-22 13:52:28

@Gumbo: this regexp should take into account single or double quotes, since it used character class ['"]

VonC 2009-02-22 14:52:20

It could not, of course manage quotes within an attribute value

VonC 2009-02-22 14:54:13

I know. But the value of foo would be “bar' bar='bla” and not just “bar”.

Gumbo 2009-02-22 15:04:27

Answer 6

A:

I'd reconsider the strategy to use only a single regular expression. Sure it's a nice game to come up with one single regular expression that does it all. But in terms of maintainabilty you are about to shoot yourself in both feet.

innaM 2008-11-25 11:40:02

Answer 7

+6 A:

Just to agree with everyone else: don't parse HTML using regexp.

It isn't possible to create an expression that will pick out attributes for even a correct piece of HTML, never mind all the possible malformed variants. Your regexp is already pretty much unreadable even without trying to cope with the invalid lack of quotes; chase further into the horror of real-world HTML and you will drive yourself crazy with an unmaintainable blob of unreliable expressions.

There are existing libraries to either read broken HTML, or correct it into valid XHTML which you can then easily devour with an XML parser. Use them.

bobince 2008-11-25 12:43:23

Answer 8

+2 A:

Although the advice not to parse HTML via regexp is valid, here's a expression that does pretty much what you asked:

/
   \G                     # start where the last match left off
   (?>                    # begin non-backtracking expression
       .*?                # *anything* until...
       <[Aa]\b            # an anchor tag
    )??                   # but look ahead to see that the rest of the expression
                          #    does not match.
    \s+                   # at least one space
    ( \p{Alpha}           # Our first capture, starting with one alpha
      \p{Alnum}*          # followed by any number of alphanumeric characters
    )                     # end capture #1
    (?: \s* = \s*         # a group starting with a '=', possibly surrounded by spaces.
        (?: (['"])        # capture a single quote character
            (.*?)         # anything else
             \2           # which ever quote character we captured before
        |   ( [^>\s'"]+ ) # any number of non-( '>', space, quote ) chars
        )                 # end group
     )?                   # attribute value was optional
/msx;

"But wait," you might say. "What about *comments?!?!" Okay, then you can replace the . in the non-backtracking section with: (It also handles CDATA sections.)

(?:[^<]|<[^!]|<![^-\[]|<!\[(?!CDATA)|<!\[CDATA\[.*?\]\]>|<!--(?:[^-]|-[^-])*-->)

Also if you wanted to run a substitution under Perl 5.10 (and I think PCRE), you can put \K right before the attribute name and not have to worry about capturing all the stuff you want to skip over.

Axeman 2008-11-26 00:39:10

My eyes! :-) I've got to give you a point for effort though!

bobince 2008-11-26 03:07:05

Answer 9

A:

a great resource for regular expressions is http://regexlib.com

Jason 2009-02-22 14:02:24

Answer 10

+2 A:

You cannot use the same name for multiple captures. Thus you cannot use a quantifier on expressions with named captures.

So either don’t use named captures:

(?:(\b\w+\b)\s*=\s*("[^"]*"|'[^']*'|[^"'<>\s]+)\s+)+

Or don’t use the quantifier on this expression:

(?<name>\b\w+\b)\s*=\s*(?<value>"[^"]*"|'[^']*'|[^"'<>\s]+)

This does also allow attribute values like bar=' baz='quux:

foo="bar=' baz='quux"

Well the drawback will be that you have to strip the leading and trailing quotes afterwards.

Gumbo 2009-02-22 14:05:23

Much precise than my regex. +1. Note why [^ \s] while [^\s] would suffice ?

VonC 2009-02-22 15:00:30

I just copied the regular expression from the question. :)

Gumbo 2009-02-22 15:07:10

Answer 11

A:

something like this might be helpful

'(\S+)\s*?=\s*([\'"])(.*?|)\2

2010-07-21 20:52:16

ansaurus

tags:

views:

answers:

Regular expression for extracting tag attributes

related questions