tags:

views:

333

answers:

6

I have the following regular expression which is used to give me the tags in the HTML string:

<[^>]*>

So, if I pass in the following:

<b> Bold </b>

Then it will give me:

<b>
</b>

How can I make it to give me:

<b>
Bold
</b>

UPDATE:

Here is another example to get the big picture:

If this is the text:

<b>Bold</b> This is the stuff <i>Italic</i>

then the final result would be the following:

matches[0] = <b>
matches[1] = Bold
matches[2] = </b>
matches[3] = This is the stuff
matches[4] = <i>
matches[5] = Italic 
matches[6] = </i>
+11  A: 

Do not use regular expressions to parse HTML. HTML is not regular, and therefore regex is not at all suited to parsing it. Use an HTML or XML parser instead. There are many (HT|X)ML parsers available online. What language are you using?

You're not going to be able to create a regular expression that matches HTML because of the complexity of the language. Regex operates on a class of languages smaller than the class HTML is a member of. Any regex you try to write will be hard to understand and incorrect.

Use something like XPath instead.

EDIT: You're using C#. Luckily you have an entire System.Xml namespace available to you. Also, there are other libraries for parsing HTML specifically if your HTML is not strict.

Welbog
I am using C# language!
azamsharp
@azamsharp: If you're using C# then check out the HTML Agility Pack: http://www.codeplex.com/htmlagilitypack
LukeH
A: 

If what you are using to regular expressions supports backward references you can use <(.*?)>.*?</\1>. This works in perl.

Beano
Really don't understand these vote downs - the question was "how do I change the regex I have to do something else". I understand from an implementation view point that regular expressions are not the way to parse HTML, but one many occasions learning is about demonstration of ideas outside of a realistic context. I answered the question - just don't vote it up.
Beano
A: 

Html Tags are some of the biggest pains for Regex. You have to be careful because simply matching first and last tag won't be enough if you have more than one tag on the same line, or depending on how you evaluate it, anywhere in the string you're evaluating.

Here is a decent expression you can use...

@"<(?<tag>\w*)>(?<text>.*)</\k<tag>>"

You will have named groups tag and text that you can use to access the values you have. With those values you can format your. Depending on your language, you may need to specify that you want to search the entire string as a single line.

Hugoware
The above expression only gave one big search result which was the complete string passed to it.
azamsharp
Ah, well your example showed a single tag. That changes what the answer would have been.
Hugoware
+2  A: 

If the input is XHTML, then it's also legal XML, so you can do all this with some simple XSLT.

Steven Sudit
A: 

The best solution is to use a real HTML parser. But this regex will work for the problem you mentioned.

<([^>]+?)>(.+)</\1>

You will find the content of <b> (or other tag) in backreference 2. It works fairly well with nested tags too.

<b><i>Strong<b>Strong 2</b></i></b> will give you <i>Strong<b>Strong 2</b></i> in backreference 2.

I don't know the .NET regex API, so you have to figure out how to use it yourself.

Imran
+1  A: 

I second the advice not to use reglar expressions; html can't be properly expressed using a regular language.

Better to investigate System.Xml.XmlReader and System.Web.HtmlWriter. You should be able to write a function that reads an element from a reader then writes it to a writer; something along the lines of

    public static string HtmlReformat(string html)
    {
        var sw = new StringWriter();
        HtmlTextWriter htmlWriter = new HtmlTextWriter(sw);

        XmlReader rdr = XmlReader.Create(new StringReader(html));

        while (rdr.Read())
        {
            switch (rdr.NodeType)
            {
                case XmlNodeType.EndElement:
                    htmlWriter.WriteEndTag(rdr.Name);
                    htmlWriter.Write(System.Environment.NewLine);
                    break;
                case XmlNodeType.Element:
                        htmlWriter.WriteBeginTag(rdr.Name);
                        for (int attributeIdx = 0; attributeIdx < rdr.AttributeCount; attributeIdx++)
                        {
                                string attribName = rdr.GetAttribute(attributeIdx);
                                htmlWriter.WriteAttribute(rdr.Name, attribName);
                        }
                        htmlWriter.Write(">");
                        htmlWriter.Write(System.Environment.NewLine);
                        break;
                case XmlNodeType.Text:
                    htmlWriter.Write(rdr.Value);
                    break; 
                default:
                    throw new NotImplementedException("Handle " + rdr.NodeType);
            }

        }
        return sw.ToString();
    }

This should give you a base to work from, anyway.

Steve Cooper