tags:

views:

630

answers:

5

Hello,

The sad truth about this post is that I have poor regex skills. I recently came across some code in an old project that I seriously want to do something about. Here it is:

strDocument = strDocument.Replace("font size=""1""", "font size=0.2")
strDocument = strDocument.Replace("font size='1'", "font size=0.2")
strDocument = strDocument.Replace("font size=1", "font size=0.2")
strDocument = strDocument.Replace("font size=""2""", "font size=1.5")
strDocument = strDocument.Replace("font size='2'", "font size=1.5")
strDocument = strDocument.Replace("font size=2", "font size=1.5")
strDocument = strDocument.Replace("font size=3", "font size=2")
strDocument = strDocument.Replace("font size=""3""", "font size=2")
strDocument = strDocument.Replace("font size='3'", "font size=2")

I'm guessing there is some easy regex pattern out there that I could use to find different ways of quoting attribute values and replace them with valid syntax. For example if somebody wrote some HTML that looks like:

<tag attribute1=value attribute2='value' />

I'd like to be able to easily clean that tag so that it ends up looking like

<tag attribute1="value" attribute2="value" />

The web application I'm working with is 10 years old and there are several thousand validation errors because of missing quotes and tons of other garbage, so if anybody could help me out that would be great!

EDIT:

I gave it a whirl (found some examples), and have something that will work, but would like it to be a little smarter:

Dim input As String = "<tag attribute=value attribute='value' attribute=""value"" />"
Dim test As String = "attribute=(?:(['""])(?<attribute>(?:(?!\1).)*)\1|(?<attribute>\S+))"
Dim result As String = Regex.Replace(input, test, "attribute=""$2""")

This outputs result correctly as:

<tag attribute="value" attribute="value" attribute="value" />

Is there a way I could change (and simplify!) this up a bit so that I could make it look for any attribute name?

UPDATE:

Here's what I have so far based on the comments. Perhaps it could be improved even more:

Dim input As String = "<tag border=2 style='display: none' width=""100%"" />"
Dim test As String = "\s*=\s*(?:(['""])(?<g1>(?:(?!\1).)*)\1|(?<g1>\S+))"
Dim result As String = Regex.Replace(input, test, "=""$2""")

which produces:

<tag border="2" style="display: none" width="100%" />

Any further suggestions? Otherwise I think I answered my own question, with your help of course.

FINAL UPDATE

Here is the final product. I hope this helps somebody!

Imports System.Text.RegularExpressions

Module Module1

    Sub Main()
        Dim input As String = "<tag border=2 style='display: none' width=""100%"">Some stuff""""""in between tags==="""" that could be there</tag>" & _
            "<sometag border=2 width=""100%"" /><another that=""is"" completely=""normal"">with some content, of course</another>"

        Console.WriteLine(ConvertMarkupAttributeQuoteType(input, "'"))
        Console.ReadKey()
    End Sub

    Public Function ConvertMarkupAttributeQuoteType(ByVal html As String, ByVal quoteChar As String) As String
        Dim findTags As String = "</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>"
        Return Regex.Replace(html, findTags, New MatchEvaluator(Function(m) EvaluateTag(m, quoteChar)))
    End Function

    Private Function EvaluateTag(ByVal match As Match, ByVal quoteChar As String) As String
        Dim attributes As String = "\s*=\s*(?:(['""])(?<g1>(?:(?!\1).)*)\1|(?<g1>[^>\s]+))"
        Return Regex.Replace(match.Value, attributes, String.Format("={0}$2{0}", quoteChar))
    End Function

End Module

I felt that keeping the tag finder and the attribute fixing regex separate from each other in case I wanted to change how they each work in the future. Thanks for all your input.

A: 

drop the word 'attribute', i.e.

Dim test As String = "=(?:(['""])(?<attribute>(?:(?!\1).)*)\1|(?<attribute>\S+))"

which would find every "='something'" string, fine if you have no other code in the pages, i.e. javascript.

Lazarus
A: 

I think it's better not to mix it in single mega-regex. I'd prefer several steps:

  1. Identify tag: <([^>]+)/?>
  2. Replace wrong attributes with correct ones iteratively through tag string: replace \s+([\w]+)\s*=\s*(['"]?)(\S+)(\2) pattern with $1="$3" (with a space after last quote). I think that .net allows to track boundaries of match. It can help to avoid searching through already corrected part of tag.
Rorick
+3  A: 

What about using a tool like Tidy (http://tidy.sourceforge.net/) which can clean up your HTML code instead hunting down the validation error on your own with regex? Just my two cent.

da8
A: 

I answered my own question. Please see the FINAL UPDATE in my question for the answer I came up with.

Cory Larson
A: 

I had trouble that the final update (8/21/09) would replace

<font color=red size=4>

with

<font color="red" size="4>"

(placing the closing quote on second attribute on outside of closing tag)

I changed the attributes string in EvaluateTag to:

Dim attributes As String = "\s*=\s*(?:('|"")(?<g1>(?:(?!\1).)*)\1|(?<g1>[^>|\s]+))"

changed [^>|\s] near end.

This returns my desired results of: <font color="red" size="4">

It works on my exhaustive testcase of one.

Al
Good catch. I looked back at my code and I had came up with `\s*=\s*(?:(['""])(?<g1>(?:(?!\1).)*)\1|(?<g1>[^>\s]+))` as my fix (you have a pipe in there and I don't). I guess I forgot to update the post. Thanks!
Cory Larson