ansaurus

Question

Is ">" (U+003E GREATER-THAN SIGN) allowed inside an html-element attribute value?

Answer 1

A:

yeah except /<tag[^>]*>.*?<\/tag>/

Will not match a single tag, but match the first start-tag and the last end-tag for a given tag. Just like your first non-greedy tag-match, your in-between should be written non-greedy as well.

Per Hornshøj-Schierbeck 2008-09-18 17:10:51

I don't understand. Could you give an example?

J.F. Sebastian 2008-09-18 17:31:28

@j-f-sebastian: <div class='foo'><span>flo</span><div>bar</div></div> you match first <div but also first </div

PhiLho 2009-09-16 12:02:18

Answer 2

A:

see if you get the same result using > instead of >

Steven A. Lowe 2008-09-18 17:11:00

Answer 3

+2 A:

After reading the following:

http://www.w3.org/International/questions/qa-escapes

it looks like entity escapes are suggested everywhere (including in attributes) for < > and &

bmdhacks 2008-09-18 17:12:10

That document is wrong. Bare greater-than signs in content are valid. It also says that single ampersands are wrong, but this is not always the case for HTML.

Jim 2008-09-18 17:23:42

It doesn't say greater-than signs are invalid, it just recommends using entities instead--a recommendation only a fool would ignore, IMO. Who cares if it's valid, if most programmers, including the authors of many software tools, believe it isn't?

Alan Moore 2009-04-29 11:40:24

Answer 4

+1 A:

I believe that's valid, and the W3C validator agrees, but the authoritative source for this information is the ISO 8879:1986 standard, which costs ~150EUR/210USD. Regardless, it is not wrong to encode them, so if in doubt, encode. Additionally, if you are using an XML-based document type, you need to encode greater-than signs in the sequence ]]>.

Jim 2008-09-18 17:14:08

Answer 5

+2 A:

Literal > is legal everywhere in html content, both inside attribute values and as text within an element.

kch 2008-09-18 17:33:43

Answer 6

+5 A:

This is the textbook example everyone uses to explain why you shouldn't use regular expressions to parse HTML, you should use an HTML Parser.

AmbroseChapel 2008-09-25 01:47:23

Answer 7

+1 A:

If you insist on using regular expressions (which is appropriate for basic string operations) try using <tag((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)>.*?<\/tag>. It should match attributes perfectly and therefore allowing you to access the inner content (although you need to put it in a capture group).

You may also use the Html Agility Pack for parsing HTML, which I would recommend if you are going to do a lot of parsing. Maintaining large regular expressions can easily become a headache, but in the meanwhile they are also much more effective if you are able to do so.

troethom 2008-09-25 02:13:56

Answer 8

+5 A:

Yes, it is allowed (W3C Validator accepts it, only issues a warning).

Unescaped < and > are also allowed inside comments, so such simple regexp can be fooled.

If BeautifulSoup doesn't handle this, it could be a bug or perhaps a conscious design decision to make it more resilient to missing closing quotes in attributes.

porneL 2008-10-19 23:10:50

ansaurus

tags:

views:

answers:

Is ">" (U+003E GREATER-THAN SIGN) allowed inside an html-element attribute value?

related questions