tags:

views:

66

answers:

3

Hello, I am in a situation where I am having HTML markup with some text outside of it (leading or trailing). What regex should I be using? For example:

some text over here
<Html>
<Title>website</Title>
<Body>
text text text
<Div>xxxxx</Div>
</Body>
</Html>
ending text

So, I should be getting "some text over here" and "ending text" only....All the html + text inside every tag should not deducted.

Another example:

abcdef<div>xyz</div>

It should return "abcdef"

Any approach or suggestion would be greatly appreciated. Thank you

+2  A: 

I personally wouldn't use regex for this. I don't know if you can have an alternative but if you can load the HTML fragment into some kind of DOM then you should be able to easily just find all tags and children and strip them out.

I can't see your examples but if you do have the special case where your outside text is always at the beginning or end of the text then something like this should work:

^(.*?)<.*>(.*?)$ with the first and second brackets matchign the text you want. If however you can have

text<b>HTML</b>text<b>HTML</b>text

And of course worse scenarios of multiply nested HTML where you want the output to be "texttexttext" then regular expressions are likely to be very complicated I'd think.

Chris
Don't forget `<html><body><!-- tell them about ending the document with </html> --> <p>Hi, we're going to talk about HTML!</p> ... </body></html>`, even if you do match up the tags. You *need* to parse the HTML properly, and that can't be done with regexes.
Andrzej Doyle
Excellent point. I'd forgot all about the nightmares of tags in comments and such like. :)
Chris
A: 

Search for

(.*?)<.*>(.*?)

and replace with

$1 $2

That should do it assuming that the text before or after the HTML document never contains < or >. If that's a possibility, things get a bit more complex. Depending on what the file will be like, you can remove everything from the starting HTML tag or doctype all the way to the ending HTML tag (Ignore case):

(.*?)<(doctype|html).*</html>(.*?)

and replace with

$1 $3
Sylverdrag