views:

1000

answers:

8

In other words may one use /<tag[^>]*>.*?<\/tag>/ regex to match the tag html element which does not contain nested tag elements?

For example (lt.html):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"&gt;
<html>
  <head>
    <title>greater than sign in attribute value</title>
  </head>
  <body>
    <div>1</div>
    <div title=">">2</div>
  </body>
</html>

Regex:

$ perl -nE"say $1 if m~<div[^>]*>(.*?)</div>~" lt.html

And screen-scraper:

#!/usr/bin/env python
import sys
import BeautifulSoup

soup = BeautifulSoup.BeautifulSoup(sys.stdin)
for div in soup.findAll('div'):
    print div.string


$ python lt.py <lt.html

Both give the same output:

1
">2

Expected output:

1
2

w3c says:

Attribute values are a mixture of text and character references, except with the additional restriction that the text cannot contain an ambiguous ampersand.

A: 
yeah except /<tag[^>]*>.*?<\/tag>/

Will not match a single tag, but match the first start-tag and the last end-tag for a given tag. Just like your first non-greedy tag-match, your in-between should be written non-greedy as well.

Per Hornshøj-Schierbeck
I don't understand. Could you give an example?
J.F. Sebastian
@j-f-sebastian: <div class='foo'><span>flo</span><div>bar</div></div> you match first <div but also first </div
PhiLho
A: 

see if you get the same result using &gt; instead of >

Steven A. Lowe
+2  A: 

After reading the following:

http://www.w3.org/International/questions/qa-escapes

it looks like entity escapes are suggested everywhere (including in attributes) for < > and &

bmdhacks
That document is wrong. Bare greater-than signs in content are valid. It also says that single ampersands are wrong, but this is not always the case for HTML.
Jim
It doesn't say greater-than signs are invalid, it just recommends using entities instead--a recommendation only a fool would ignore, IMO. Who cares if it's valid, if most programmers, including the authors of many software tools, believe it isn't?
Alan Moore
+1  A: 

I believe that's valid, and the W3C validator agrees, but the authoritative source for this information is the ISO 8879:1986 standard, which costs ~150EUR/210USD. Regardless, it is not wrong to encode them, so if in doubt, encode. Additionally, if you are using an XML-based document type, you need to encode greater-than signs in the sequence ]]>.

Jim
+2  A: 

Literal > is legal everywhere in html content, both inside attribute values and as text within an element.

kch
+5  A: 

This is the textbook example everyone uses to explain why you shouldn't use regular expressions to parse HTML, you should use an HTML Parser.

AmbroseChapel
+1  A: 

If you insist on using regular expressions (which is appropriate for basic string operations) try using <tag((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)>.*?<\/tag>. It should match attributes perfectly and therefore allowing you to access the inner content (although you need to put it in a capture group).

You may also use the Html Agility Pack for parsing HTML, which I would recommend if you are going to do a lot of parsing. Maintaining large regular expressions can easily become a headache, but in the meanwhile they are also much more effective if you are able to do so.

troethom
+5  A: 

Yes, it is allowed (W3C Validator accepts it, only issues a warning).

Unescaped < and > are also allowed inside comments, so such simple regexp can be fooled.

If BeautifulSoup doesn't handle this, it could be a bug or perhaps a conscious design decision to make it more resilient to missing closing quotes in attributes.

porneL