views:

122

answers:

2

This is a follow up to another question of mine. The solution I found worked great for every one of the test cases I threw at it, until a case showed up that eluded me the first time around.

My goal is to reformat improperly formatted tag attributes using regex (I know, probably not a fool-proof method as I'm finding out, but bear with me).

My functions:

Public Function ConvertMarkupAttributeQuoteType(ByVal html As String) As String
    Dim findTags As String = "</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>"
    Return Regex.Replace(html, findTags, AddressOf EvaluateTag)
End Function

Private Function EvaluateTag(ByVal match As Match) As String
    Dim attributes As String = "\s*=\s*(?:(['""])(?<g1>(?:(?!\1).)*)\1|(?<g1>\S+))"
    Return Regex.Replace(match.Value, attributes, "='$2'")
End Function

The regex in the EvaluateTag function will correctly transform HTML like

<table border=2 cellpadding='2' cellspacing="1">

into

<table border='2' cellpadding='2' cellspacing='1'>

You'll notice I'm forcing attribute values to be surrounded by single quotes -- don't worry about that. The case that it breaks on is if the last attribute value doesn't have anything around it.

<table width=100 border=0>

comes out of the regex replace as

<table width='100' border='0>'

with the last single quote incorrectly outside of the tag. I've confessed before that I'm not good at regex at all; I just haven't taken the time to understand everything it can do. So, I'm asking for some help adjusting the EvaluateTag regex so that it can handle this final case.

Thank you!

A: 

The first RegEx function will pass EvaluateTag the entire match, which is the entire HTML tag.

But EvaluateTag doesn't ignore the final greater-than character...

I'm afraid I haven't had enough caffeine yet to work through the entire expression, but this adjustment may work (added a greater-than in the character list):

 Private Function EvaluateTag(ByVal match As Match) As String
   Dim attributes As String = "\s*=\s*(?:(['"">])(?<g1>(?:(?!\1).)*)\1|(?<g1>\S+))"
   Return Regex.Replace(match.Value, attributes, "='$2'")
 End Function
richardtallent
That didn't quite work. Actually, it didn't have any effect at all on the original regex.
Cory Larson
A: 

richardtallent's explanation of why the regex wasn't working pointed me in the right direction. After playing around a bit, the following replacement for the EvaluateTag function seems to be working.

Can anybody see anything problematic with it? The change I made is in the last group after the pipe. Maybe it could even more simplified further?

 Private Function EvaluateTag(ByVal match As Match) As String
   Dim attributes As String = "\s*=\s*(?:(['""])(?<g1>(?:(?!\1).)*)\1|(?<g1>[^>\s]+))"
   Return Regex.Replace(match.Value, attributes, "='$2'")
 End Function

If no one responds I'll probably accept this as the answer. Thanks again!

Cory Larson