ansaurus

Question

Regular Expression to Extract HTML Body Content

Answer 1

+4 A:

XHTML would be more easily parsed with an XML parser, than with a regex. I know it's not what youre asking, but an XML parser would be able to quickly navigate to the body node and give you back its content without any tag mapping problems that the regex is giving you.

EDIT: In response to a comment here; that an XML parser is too slow.

There are two kinds of XML parser, one called DOM is big and heavy and easy and friendly, it builds a tree out of the document before you can do anything. The other is called SAX and is fast and light and more work, it reads the file sequentially. You will want SAX to find the Body tag.

The DOM method is good for multiple uses, pulling tags and finding who is what's child. The SAX parser reads across the file in order and qill quickly get the information you are after. The Regex won't be any faster than a SAX parser, because they both simply walk across the file and pattern match, with the exception that the regex won't quit looking after it has found a body tag, because regex has no built in knowledge of XML. In fact, your SAX parser probably uses small pieces of regex to find each tag.

Karl 2008-12-10 15:04:25

No reason to re-invent the wheel. If it's XHTML, it's XML, and an XML parser is the tool for the job. +1

Adam Jaskiewicz 2008-12-10 15:09:59

This was the first solution I tired, but it appeared to be running pretty slow. I figured RegEx would be faster.

Matthew Ruston 2008-12-10 15:13:18

There are two kinds of XML parser, one called DOM is big and heavy and easy and friendly, it builds a tree out of the document before you can do anything. The other is called SAX and is fast and light and more work, it reads the file sequentially. You will want SAX to find the Body tag.

Karl 2008-12-10 15:19:48

this is an extremely simple job for a parser, it really shouldn't be slow

annakata 2008-12-10 15:21:22

I tried it originally with .NET's System.Xml.XmlDocument class if that explains any of the slowness. – Matthew Ruston

Matthew Ruston 2008-12-10 15:30:47

Even if it is slower, it will handle all of the exceptional cases like name="</body>" etc...

Max 2008-12-12 07:36:18

Answer 2

+5 A:

Would this work ?

((?:.(?!<body[^>]*>))+.<body[^>]*>)|(</body\>.+)

Of course, you need to add the necessary \s in order to take into account < body ...> (element with spaces), as in:

((?:.(?!<\s*body[^>]*>))+.<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)

On second thought, I am not sure why I needed a negative look-ahead... This should also work (for a well-formed xhtml document):

(.*<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)

VonC 2008-12-10 15:05:25

Second one did the trick for me. Thanks.

Matthew Ruston 2008-12-10 15:43:30

Mmm, looks like a good case for demonstrating REs shouldn't be used against (unknown) HTML: <body onload="DoSomething('>');"> is valid... :-)

PhiLho 2008-12-10 16:10:56

Answer 3

A:

(.*)<body([^>]*)>(.*)</body>(.*)

replace with

\3

Kev 2008-12-10 15:07:09

This should match the entire document and put the body into \3. So you know if it doesn't match the entire document that the current document's formatting has something else to consider, and you can throw an error.

Kev 2008-12-10 15:09:55

Answer 4

+2 A:

Why can't you just split it by

</{0,1}body[^>]*>

and take the second string? I believe it will be much faster than looking for a huge regexp.

Max 2008-12-10 15:07:27

Because his initial body tag has an attribute...

Kev 2008-12-10 15:08:13

That said, if you fix that your approach may be simpler. :)

Kev 2008-12-10 15:08:55

Well, I've just noticed it before you posted the comment and edited this answer :P

Max 2008-12-10 15:09:39

Thats </?body[^>]*>

Tomalak 2008-12-10 15:34:18

I don't actually have enough points to edit...must've been someone else.

Kev 2008-12-10 15:54:10

ansaurus

tags:

views:

answers:

Regular Expression to Extract HTML Body Content

related questions