ansaurus

Question

Using regular expression to trim html

Answer 1

A:

s/[^\w\/\d<>]+/gs

2009-06-02 17:56:13

Answer 2

A:

s/>\s+</></gs

JSBangs 2009-06-02 17:58:02

Answer 3

+18 A:

If the HTML is strict, load it with an XML reader and write it back without formatting. That will preserve the whitespace within tags, but not between them.

Welbog 2009-06-02 17:58:20

Not to mention it doesn't reinvent the wheel.

Pesto 2009-06-02 18:00:46

Not a bad idea...

Tim 2009-06-02 18:01:02

that might depend on the schema. Preservation of whitespace inside tags is a specific attribute in schema definitions.

Jherico 2009-06-02 18:19:46

This. Trying to parse xml/html/other CFLs with a regular expression is impossible to do 100% correctly.

Stuart Branham 2010-06-17 06:25:38

Answer 4

+1 A:

s/\s*(<[^>]+>)\s*/\1/gs

or, in c#:

Regex.Replace(html, "\s*(<[^>]+>)\s*", "$1", RegexOptions.SingleLine);

ʞɔıu 2009-06-02 17:58:30

the first character cannot be a space, or a valid HTML string like "if a < 3 and b > 4" would be deleted with your expression

Yann Schwartz 2009-06-02 18:45:03

And you don't match ending tags </xxx> either.

Yann Schwartz 2009-06-02 18:46:41

Ok my bad. I did not read this right.

Yann Schwartz 2009-06-02 18:48:01

Your first point isn't wrong, though. That'll change "if a < 3 and b > 4" to "if a<3 and b>4", which is probably OK if that's script, but probably not desirable if it's, say, the text of an article about using whitespace for readability.

Robert Rossney 2009-06-02 20:07:56

Yeah the <[^>]+> to match all html tag innards has a number of edge cases. There are more complete patterns that could be used instead of that subpattern, but this demonstrates the basic idea.

ʞɔıu 2009-06-02 21:20:13

Answer 5

A:

This removes the whitespace between tags and the space between the tags and the text.

s/(\s*(<))|((>)\s*)/\2\4/g

Bran Handley 2009-06-02 19:18:46

Answer 6

+1 A:

\d does not match only [0-9] in Perl 5.8 and 5.10; it matches any UNICODE character that has the digit attribute (including "\x{1815}" and "\x{FF15}"). If you mean [0-9] you must either use [0-9] or use the bytes pragma (but it turns all strings in 1 byte characters and is normally not what you want).

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

You may find the HTMLAgilityPack answer helpful.

Chas. Owens 2009-06-02 21:53:03

Answer 7

A:

I wanted to preserve the new lines, since the removal of newlines was messing up my html. So I went with the following. Check out the results on perfode.com.

private static string ProcessHTMLFile(string input)
{
    string opt = Regex.Replace(input, @"(  )*", "", RegexOptions.Singleline);
    opt = Regex.Replace(opt, @"[\t]*", "", RegexOptions.Singleline);
    return opt;
}

Shash 2010-06-14 05:00:27

Answer 8

A:

Regex.Replace(input, "<[^>]*>", String.Empty);

dankyy1 2010-06-17 06:18:47

ansaurus

tags:

views:

answers:

Using regular expression to trim html

related questions