Hello,
The sad truth about this post is that I have poor regex skills. I recently came across some code in an old project that I seriously want to do something about. Here it is:
strDocument = strDocument.Replace("font size=""1""", "font size=0.2")
strDocument = strDocument.Replace("font size='1'", "font size=0.2")
strDocument = strDocument.Replace("font size=1", "font size=0.2")
strDocument = strDocument.Replace("font size=""2""", "font size=1.5")
strDocument = strDocument.Replace("font size='2'", "font size=1.5")
strDocument = strDocument.Replace("font size=2", "font size=1.5")
strDocument = strDocument.Replace("font size=3", "font size=2")
strDocument = strDocument.Replace("font size=""3""", "font size=2")
strDocument = strDocument.Replace("font size='3'", "font size=2")
I'm guessing there is some easy regex pattern out there that I could use to find different ways of quoting attribute values and replace them with valid syntax. For example if somebody wrote some HTML that looks like:
<tag attribute1=value attribute2='value' />
I'd like to be able to easily clean that tag so that it ends up looking like
<tag attribute1="value" attribute2="value" />
The web application I'm working with is 10 years old and there are several thousand validation errors because of missing quotes and tons of other garbage, so if anybody could help me out that would be great!
EDIT:
I gave it a whirl (found some examples), and have something that will work, but would like it to be a little smarter:
Dim input As String = "<tag attribute=value attribute='value' attribute=""value"" />"
Dim test As String = "attribute=(?:(['""])(?<attribute>(?:(?!\1).)*)\1|(?<attribute>\S+))"
Dim result As String = Regex.Replace(input, test, "attribute=""$2""")
This outputs result
correctly as:
<tag attribute="value" attribute="value" attribute="value" />
Is there a way I could change (and simplify!) this up a bit so that I could make it look for any attribute name?
UPDATE:
Here's what I have so far based on the comments. Perhaps it could be improved even more:
Dim input As String = "<tag border=2 style='display: none' width=""100%"" />"
Dim test As String = "\s*=\s*(?:(['""])(?<g1>(?:(?!\1).)*)\1|(?<g1>\S+))"
Dim result As String = Regex.Replace(input, test, "=""$2""")
which produces:
<tag border="2" style="display: none" width="100%" />
Any further suggestions? Otherwise I think I answered my own question, with your help of course.
FINAL UPDATE
Here is the final product. I hope this helps somebody!
Imports System.Text.RegularExpressions
Module Module1
Sub Main()
Dim input As String = "<tag border=2 style='display: none' width=""100%"">Some stuff""""""in between tags==="""" that could be there</tag>" & _
"<sometag border=2 width=""100%"" /><another that=""is"" completely=""normal"">with some content, of course</another>"
Console.WriteLine(ConvertMarkupAttributeQuoteType(input, "'"))
Console.ReadKey()
End Sub
Public Function ConvertMarkupAttributeQuoteType(ByVal html As String, ByVal quoteChar As String) As String
Dim findTags As String = "</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>"
Return Regex.Replace(html, findTags, New MatchEvaluator(Function(m) EvaluateTag(m, quoteChar)))
End Function
Private Function EvaluateTag(ByVal match As Match, ByVal quoteChar As String) As String
Dim attributes As String = "\s*=\s*(?:(['""])(?<g1>(?:(?!\1).)*)\1|(?<g1>[^>\s]+))"
Return Regex.Replace(match.Value, attributes, String.Format("={0}$2{0}", quoteChar))
End Function
End Module
I felt that keeping the tag finder and the attribute fixing regex separate from each other in case I wanted to change how they each work in the future. Thanks for all your input.