tags:

views:

1143

answers:

8

Been trying to solve this for a while now.

I need a regex to strip the newlines, tabs and spaces between the html tags demonstrated in the example below:

Source:

<html>
   <head>
     <title>
           Some title
       </title>
    </head>
</html>

Wanted result:

<html><head><title>Some title</title></head></html>

The trimming of the whitespaces before the "Some title" is optional. I'd be grateful for any help

A: 

s/[^\w\/\d<>]+/gs

A: 

s/>\s+</></gs

JSBangs
+18  A: 

If the HTML is strict, load it with an XML reader and write it back without formatting. That will preserve the whitespace within tags, but not between them.

Welbog
Not to mention it doesn't reinvent the wheel.
Pesto
Not a bad idea...
Tim
that might depend on the schema. Preservation of whitespace inside tags is a specific attribute in schema definitions.
Jherico
This. Trying to parse xml/html/other CFLs with a regular expression is impossible to do 100% correctly.
Stuart Branham
+1  A: 

s/\s*(<[^>]+>)\s*/\1/gs

or, in c#:

Regex.Replace(html, "\s*(<[^>]+>)\s*", "$1", RegexOptions.SingleLine);

ʞɔıu
the first character cannot be a space, or a valid HTML string like "if a < 3 and b > 4" would be deleted with your expression
Yann Schwartz
And you don't match ending tags </xxx> either.
Yann Schwartz
Ok my bad. I did not read this right.
Yann Schwartz
Your first point isn't wrong, though. That'll change "if a < 3 and b > 4" to "if a<3 and b>4", which is probably OK if that's script, but probably not desirable if it's, say, the text of an article about using whitespace for readability.
Robert Rossney
Yeah the <[^>]+> to match all html tag innards has a number of edge cases. There are more complete patterns that could be used instead of that subpattern, but this demonstrates the basic idea.
ʞɔıu
A: 

This removes the whitespace between tags and the space between the tags and the text.

s/(\s*(<))|((>)\s*)/\2\4/g
Bran Handley
+1  A: 

\d does not match only [0-9] in Perl 5.8 and 5.10; it matches any UNICODE character that has the digit attribute (including "\x{1815}" and "\x{FF15}"). If you mean [0-9] you must either use [0-9] or use the bytes pragma (but it turns all strings in 1 byte characters and is normally not what you want).

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

You may find the HTMLAgilityPack answer helpful.

Chas. Owens
A: 

I wanted to preserve the new lines, since the removal of newlines was messing up my html. So I went with the following. Check out the results on perfode.com.

private static string ProcessHTMLFile(string input)
{
    string opt = Regex.Replace(input, @"(  )*", "", RegexOptions.Singleline);
    opt = Regex.Replace(opt, @"[\t]*", "", RegexOptions.Singleline);
    return opt;
}
Shash
A: 
Regex.Replace(input, "<[^>]*>", String.Empty);
dankyy1