tags:

views:

495

answers:

6

Hi,

trying to use regex to replace any white space with " ", inside of example html

<span someattr="a">and some words with spaces</span>

It's a desktop app and this html is coming to/from a third party control and don't have the luxury of working with any type of html parsing so am stuck with regex

I can't seem come up with a regex that would just match any whitespace inside any number of span tags.

Thanks

+1  A: 

Regex on its own is a poor fit for nested data. Your best bet if you can't use a third-party parser is to bite the bullet and write some code - perhaps using a parser generator - to parse the nesting.

(That said, check the documentation for your regexp library; you may find it has extensions to aid parsing of nested data, e.g. .net's balancing groups construct)

moonshadow
+1  A: 

This could potentially be very slow with very large strings.

But this works:

(?<=\<span[^>]*>[^<]+)\s(?=[^<]+\</span>)

With a replacement string of:

&nbsp;

The reason I say it might be slow is that it's having to find the whitespace (\s) and then search towards the left and to the right to see if it's surrounded by a span tag. And it'll have to do the same thing for every character of whitespace individually. But I believe this should work reliably as long as your HTML is well-formed and you aren't dealing with nested span tags.

And by the way, since this is for .NET you can use Regex Hero to build the code for you:

string strRegex = "(?<=\<span[^>]*>[^<]+)\s(?=[^<]+\</span>)";
RegexOptions myRegexOptions = RegexOptions.None;
Regex myRegex = new Regex(strRegex, myRegexOptions);
string strTargetString = "<span someattr=\"a\">and some words with spaces</span>";
string strReplace = "&nbsp;";

return myRegex.Replace(strTargetString, strReplace);
Steve Wortham
This won't actually work.If there are two spans in the string, this will match any spaces between the two spans, as well as any spaces within either span.
Paul Williams
Ah, good point. I just fixed that. This still isn't a perfect solution. Regex in general is a little too limited for this complex of a task.
Steve Wortham
this is perfect, thank you!
andryuha
This will only work for span elements which contain no other xml. If I understand the question correctly, you want to be able to find the spaces even if there are span (or other) elements within an outer span element.
Paul Williams
I finally figured out how to find all spaces in <span> elements regardless of any other elements in the area. See my answer below if you are interested.
Paul Williams
A: 

How about this? Note that the code block is eating up the &nbsp; so I separated the ampersand from the rest of the text to make it visible. The line inside the regex replace actually reads:

m.Groups["text"].Value.Replace(" ", "&nbsp;")

Here's the sample:

string html = @"<span someattr=""a"">and some words with spaces</span>";
string pattern = @"<(?<tag>\w*)(?<attributes>[^>]+)?>(?<text>.*)</\k<tag>>";
string result = Regex.Replace(html, pattern,
       m => String.Format("<{0}{1}>{2}</{0}>",
        m.Groups["tag"].Value,
        m.Groups["attributes"].Value,
        m.Groups["text"].Value.Replace(" ", "& nbsp;")
        )
       );

Result = <span someattr="a">and&nbsp;some&nbsp;words&nbsp;with&nbsp;spaces</span>

Things will get complicated quickly if you have nested span tags, however.

EDIT: reconstructed tag and attributes, added string format to tidy things up

Ahmad Mageed
Note that this pattern would apply to any tag, not just a span tag. If you want just span tags you would have to hardcode that in instead of using the named "tag" capture group and balancing it.
Ahmad Mageed
A: 

Semi-related, in looking for a solution for this, I found a php-based perl regular expression article that may or may not be helpful for .net:

http://www.thatsquality.com/articles/how-to-match-and-replace-content-between-two-html-tags-using-regular-expressions

JYelton
+1  A: 

Replace all occurrences of the following with "&nbsp;":

(?<=<span\b[^>]*>(?:(?!</?span\b).)*(?(ReverseDepth)(?!))(?:(?:<span\b[^>]*>(?<-ReverseDepth>)|</span>(?<ReverseDepth>))(?:(?!</?span\b).)*)*)\u0020(?![^<]*>)

This should work for any depth of span elements no matter what other elements are present. Note that this will only work for .net regular expressions.

This regex is very finicky. Be careful if you try to change anything.

Thanks to moonshadow for pointing out the fancy open-close matching syntax in .net regexes.

Paul Williams
A: 

This appears to work, but I'd definitely do some serious unit testing (and code cleanup) first. This is based on section 3.17 of the Regular Expression Cookbook combined with a library snippet from RegexBuddy. (NOTE: Will not work with nested span tags.)

public class MyClass
{
    private static Regex outerRegex = new Regex("(?<=<span[^>]*>).*?(?=</span>)",
        RegexOptions.Singleline | RegexOptions.IgnoreCase);

    private static Regex innerRegex = new Regex(@"\s");

    public static void Main()
    {
        string subjectString = "my dog has <span someattr=\"a\">" +
            "and some words with spaces</span> fleas" +
            "<frog>space z</frog> <span> </span>";   

        string resultString = outerRegex.Replace(subjectString,
            new MatchEvaluator(ComputeReplacement));

        Console.WriteLine(resultString);
    }

    public static string ComputeReplacement(Match matchResult)
    {
        // Run the inner search-and-replace on each match of the outer regex
        // (the string was not getting escaped so I broke it up)
        return innerRegex.Replace(matchResult.Value, "&" + "nbsp;");
    }
}
TrueWill
Note - this won't work with nested span tags. Good catch, Ahmad!
TrueWill