I've been playing around with RegExBuddy for over an hour trying to figure out what I thought would be a trivial RegEx. I am looking for a RegEx statement that will let me extract the HTML content from just between the body tags from a XHTML document.
The XHTML that I need to parse will be very simple files, I do not have to worry about JavaScript content or <![CDATA[
tags, for example.
Below is the expected structure of the HTML file is that I have to parse. Since I know exactly all of the content of the HTML files that I am going to have to work with, this HTML snippet pretty much covers my entire use case. If I can get a RegEx to extract the body of this example, I'll be happy.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>
</title>
</head>
<body contenteditable="true">
<p>
Example paragraph content
</p>
<p>
</p>
<p>
<br />
</p>
<h1>Header 1</h1>
</body>
</html>
Conceptually, I've been trying to build a RegEx string that matches everything BUT the inner body content. With this, I would use the C# RegEx.Split()
function to obtain the body content. I thought the statement ((.|\n)*<body (.)*>)|((</body>(*|\n)*)
would do the trick, but it doesn't seem to work at all with my test content in RegExBuddy.