ansaurus

Question

What manner of regular expression might I use to add line breaks near HTML tags?

Answer 1

+11 A:

Do not use regular expressions to parse HTML. HTML is not regular, and therefore regex is not at all suited to parsing it. Use an HTML or XML parser instead. There are many (HT|X)ML parsers available online. What language are you using?

You're not going to be able to create a regular expression that matches HTML because of the complexity of the language. Regex operates on a class of languages smaller than the class HTML is a member of. Any regex you try to write will be hard to understand and incorrect.

Use something like XPath instead.

EDIT: You're using C#. Luckily you have an entire System.Xml namespace available to you. Also, there are other libraries for parsing HTML specifically if your HTML is not strict.

Welbog 2009-07-02 16:26:12

I am using C# language!

azamsharp 2009-07-02 16:26:43

@azamsharp: If you're using C# then check out the HTML Agility Pack: http://www.codeplex.com/htmlagilitypack

LukeH 2009-07-02 16:28:51

Answer 2

A:

If what you are using to regular expressions supports backward references you can use <(.*?)>.*?</\1>. This works in perl.

Beano 2009-07-02 16:26:22

Really don't understand these vote downs - the question was "how do I change the regex I have to do something else". I understand from an implementation view point that regular expressions are not the way to parse HTML, but one many occasions learning is about demonstration of ideas outside of a realistic context. I answered the question - just don't vote it up.

Beano 2009-07-02 20:35:12

Answer 3

A:

Html Tags are some of the biggest pains for Regex. You have to be careful because simply matching first and last tag won't be enough if you have more than one tag on the same line, or depending on how you evaluate it, anywhere in the string you're evaluating.

Here is a decent expression you can use...

@"<(?<tag>\w*)>(?<text>.*)</\k<tag>>"

You will have named groups tag and text that you can use to access the values you have. With those values you can format your. Depending on your language, you may need to specify that you want to search the entire string as a single line.

Hugoware 2009-07-02 16:26:33

The above expression only gave one big search result which was the complete string passed to it.

azamsharp 2009-07-02 16:35:53

Ah, well your example showed a single tag. That changes what the answer would have been.

Hugoware 2009-07-02 16:46:27

Answer 4

+2 A:

If the input is XHTML, then it's also legal XML, so you can do all this with some simple XSLT.

Steven Sudit 2009-07-02 16:32:04

Answer 5

A:

The best solution is to use a real HTML parser. But this regex will work for the problem you mentioned.

<([^>]+?)>(.+)</\1>

You will find the content of  (or other tag) in backreference 2. It works fairly well with nested tags too.

StrongStrong 2 will give you StrongStrong 2 in backreference 2.

I don't know the .NET regex API, so you have to figure out how to use it yourself.

Imran 2009-07-02 16:40:20

Answer 6

+1 A:

I second the advice not to use reglar expressions; html can't be properly expressed using a regular language.

Better to investigate System.Xml.XmlReader and System.Web.HtmlWriter. You should be able to write a function that reads an element from a reader then writes it to a writer; something along the lines of

    public static string HtmlReformat(string html)
    {
        var sw = new StringWriter();
        HtmlTextWriter htmlWriter = new HtmlTextWriter(sw);

        XmlReader rdr = XmlReader.Create(new StringReader(html));

        while (rdr.Read())
        {
            switch (rdr.NodeType)
            {
                case XmlNodeType.EndElement:
                    htmlWriter.WriteEndTag(rdr.Name);
                    htmlWriter.Write(System.Environment.NewLine);
                    break;
                case XmlNodeType.Element:
                        htmlWriter.WriteBeginTag(rdr.Name);
                        for (int attributeIdx = 0; attributeIdx < rdr.AttributeCount; attributeIdx++)
                        {
                                string attribName = rdr.GetAttribute(attributeIdx);
                                htmlWriter.WriteAttribute(rdr.Name, attribName);
                        }
                        htmlWriter.Write(">");
                        htmlWriter.Write(System.Environment.NewLine);
                        break;
                case XmlNodeType.Text:
                    htmlWriter.Write(rdr.Value);
                    break; 
                default:
                    throw new NotImplementedException("Handle " + rdr.NodeType);
            }

        }
        return sw.ToString();
    }

This should give you a base to work from, anyway.

Steve Cooper 2009-07-02 17:05:39

ansaurus

tags:

views:

answers:

What manner of regular expression might I use to add line breaks near HTML tags?

related questions