tags:

views:

8358

answers:

4

I've been playing around with RegExBuddy for over an hour trying to figure out what I thought would be a trivial RegEx. I am looking for a RegEx statement that will let me extract the HTML content from just between the body tags from a XHTML document.

The XHTML that I need to parse will be very simple files, I do not have to worry about JavaScript content or <![CDATA[ tags, for example.

Below is the expected structure of the HTML file is that I have to parse. Since I know exactly all of the content of the HTML files that I am going to have to work with, this HTML snippet pretty much covers my entire use case. If I can get a RegEx to extract the body of this example, I'll be happy.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"&gt;
<html xmlns="http://www.w3.org/1999/xhtml"&gt;
  <head>
    <title>
    </title>
  </head>
  <body contenteditable="true">
    <p>
      Example paragraph content
    </p>
    <p>
      &nbsp;
    </p>
    <p>
      <br />
      &nbsp;
    </p>
    <h1>Header 1</h1>
  </body>
</html>

Conceptually, I've been trying to build a RegEx string that matches everything BUT the inner body content. With this, I would use the C# RegEx.Split() function to obtain the body content. I thought the statement ((.|\n)*<body (.)*>)|((</body>(*|\n)*) would do the trick, but it doesn't seem to work at all with my test content in RegExBuddy.

+4  A: 

XHTML would be more easily parsed with an XML parser, than with a regex. I know it's not what youre asking, but an XML parser would be able to quickly navigate to the body node and give you back its content without any tag mapping problems that the regex is giving you.

EDIT: In response to a comment here; that an XML parser is too slow.

There are two kinds of XML parser, one called DOM is big and heavy and easy and friendly, it builds a tree out of the document before you can do anything. The other is called SAX and is fast and light and more work, it reads the file sequentially. You will want SAX to find the Body tag.

The DOM method is good for multiple uses, pulling tags and finding who is what's child. The SAX parser reads across the file in order and qill quickly get the information you are after. The Regex won't be any faster than a SAX parser, because they both simply walk across the file and pattern match, with the exception that the regex won't quit looking after it has found a body tag, because regex has no built in knowledge of XML. In fact, your SAX parser probably uses small pieces of regex to find each tag.

Karl
No reason to re-invent the wheel. If it's XHTML, it's XML, and an XML parser is the tool for the job. +1
Adam Jaskiewicz
This was the first solution I tired, but it appeared to be running pretty slow. I figured RegEx would be faster.
Matthew Ruston
There are two kinds of XML parser, one called DOM is big and heavy and easy and friendly, it builds a tree out of the document before you can do anything. The other is called SAX and is fast and light and more work, it reads the file sequentially. You will want SAX to find the Body tag.
Karl
this is an extremely simple job for a parser, it really shouldn't be slow
annakata
I tried it originally with .NET's System.Xml.XmlDocument class if that explains any of the slowness. – Matthew Ruston
Matthew Ruston
Even if it is slower, it will handle all of the exceptional cases like name="</body>" etc...
Max
+5  A: 

Would this work ?

((?:.(?!<body[^>]*>))+.<body[^>]*>)|(</body\>.+)

Of course, you need to add the necessary \s in order to take into account < body ...> (element with spaces), as in:

((?:.(?!<\s*body[^>]*>))+.<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)

On second thought, I am not sure why I needed a negative look-ahead... This should also work (for a well-formed xhtml document):

(.*<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)
VonC
Second one did the trick for me. Thanks.
Matthew Ruston
Mmm, looks like a good case for demonstrating REs shouldn't be used against (unknown) HTML: <body onload="DoSomething('>');"> is valid... :-)
PhiLho
A: 
(.*)<body([^>]*)>(.*)</body>(.*)

replace with

\3
Kev
This should match the entire document and put the body into \3. So you know if it doesn't match the entire document that the current document's formatting has something else to consider, and you can throw an error.
Kev
+2  A: 

Why can't you just split it by

</{0,1}body[^>]*>

and take the second string? I believe it will be much faster than looking for a huge regexp.

Max
Because his initial body tag has an attribute...
Kev
That said, if you fix that your approach may be simpler. :)
Kev
Well, I've just noticed it before you posted the comment and edited this answer :P
Max
Thats </?body[^>]*>
Tomalak
I don't actually have enough points to edit...must've been someone else.
Kev