views:

44

answers:

4

I am trying to write a regex that matches entire contents of a tag, minus any leading or trailing whitespace. Here is a boiled-down example of the input:

<tag> text </tag>

I want only the following to be matched (note how the whitespace before and after the match has been trimmed):

"text"

I am currently trying to use this regex in .NET (Powershell):

(?<=<tag>(\s)*).*?(?=(\s)*</tag>)

However, this regex matches "text" plus the leading whitespace inside of the tag, which is undesired. How can I fix my regex to work as expected?

+3  A: 

You should not use regext to parse html.

Use a parser instead.

Also: http://stackoverflow.com/questions/3817821/regex-to-remove-body-tag-attributes-c

Also also: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

If all that doesn't convince you, then don't use the dot in the middle of your expression. Use the alphanumeric escape. Your dot is consuming whitespace. Use \w (I think) instead.

JoshD
Thanks for the answer and the comment. I was only looking for some regex pointers on this particular question; however, because of your answer and the links you posted, I am going to look into using .NET's XmlReader to parse our KML files instead of the way we're currently doing it.
Dark Lord Kvl
A: 

Use these regular expressions to strip trailing and leading whitespaces. /^\s+/ and /\s+$/

Ruel
A: 
        test = "<tag>     test    </tag>";
        string pattern3 = @"<tag>(.*?)</tag>";
        Console.WriteLine("{0}", Regex.Match(test,pattern3).Groups[1].Value.Trim());
Les
+1  A: 

Drop the lookarounds; they just make the job more complicated than it needs to be. Instead, use a capturing group to pick out the part you want:

<tag>\s*(.*?)\s*</tag>

The part you want is available as $matches[1].

Alan Moore
Thanks! This was the type of tip I was looking for, and it works great.
Dark Lord Kvl