tags:

views:

192

answers:

2

Hi,

This is a related to a previous question I have asked here, see the link below for a brief description as to why I am trying to do this.

Regular expression from font to span (size and colour) and back (VB.NET)

Basically I need a regex replace function (or if this can be done in pure VB then that's fine) to convert all ul tags in a string to textindent tags, with a different attribute value for the first textindent tag.

For example:

<ul>
   <li>This is some text</li>
   <li>This is some more text</li>
   <li>
      <ul>
         <li>This is some indented text</li>
         <li>This is some more text</li>
      </ul>
   </li>
   <li>More text!</li>
   <li>
      <ul>
         <li>This is some indented text</li>
         <li>This is some more text</li>
      </ul>
   </li>
   <li>More text!</li>
</ul>

<ul>
   <li>Another list item</li>
   <li>
      <ul>
         <li>Another nested list item</li>
       </ul>
   </li>
</ul>

Will become:

<textformat indent="0">
   <li>This is some text</li>
   <li>This is some more text</li>
   <li>
      <textformat indent="20">
         <li>This is some indented text</li>
         <li>This is some more text</li>
      </textformat>
   </li>
   <li>More text!</li>
   <li>
      <textformat indent="20">
         <li>This is some indented text</li>
         <li>This is some more text</li>
      </textformat>
   </li>
   <li>More text!</li>
</textformat>

<textformat indent="0">
   <li>Another list item</li>
   <li>
      <textformat indent="20">
         <li>Another nested list item</li>
      </textformat>
   </li>
</textformat>

Basically I want the first ul tag to have no indenting, but all nested ul tags to have an indent of 20.

I appreciate this is a strange request but hopefully that makes sense, please let me know if you have any questions.

Thanks in advance.

+1  A: 

It's possible with regex but LINQ to XML is simpler. I've included LINQ to XML and a regex solution, although I would favor the former.

Here's the LINQ to XML approach. Since ul is the top element its Name can be changed directly. Descendants will grab all the nested ul items. The only caveat with this approach is it only works if the input is well-formed. If it's wrong LINQ to XML will fail to parse it. Also, if it is well-formed and the ul isn't the top element but is part of a larger HTML block of text then you'll need to loop over Elements("ul") then do the same thing over each of them.

If the HTML is malformed you may want to look at the HTML Agility Pack.

Dim xml = XElement.Parse(input)
xml.Name = "textformat"
xml.SetAttributeValue("indent", "0")
For Each item In xml.Descendants("ul")
    item.Name = "textformat"
    item.SetAttributeValue("indent", "20")
Next

And here's the regex approach. It's not easy to detect the first ul item to distinguish between the two so this approach changes all of them to an indent of 20, then an extra step is taken to find the first textformat and change its indent to zero.

Dim pattern As String = "<ul>|</ul>"
Dim result As String = Regex.Replace(input, pattern, Function(m) If(m.Value.StartsWith("</"), "</textformat>", "<textformat indent=""20"">"))
Dim firstTextFormatPattern As String = "^(?<Start><textformat\s+indent="")\d+?(?<End>"">)"
result = Regex.Replace(result, firstTextFormatPattern, "${Start}0${End}")
Ahmad Mageed
Thanks again for you help.I am looking into the HTML Agility Pack for the LINQ solution as this is throwing a lot of exceptions (my flashed based HTML can often appear malformed).With your regex solution the final replace statement doesn't appear to be working, the first textformat tag isn't changing to have an indent attribute set to zero.Also with both solutions will they cope with mutiple ULs in the same string? (specifically converting the first tags indent attribute) By this I mean if they retrieved a string with my example above twice, I hope that makes sense....
chapmanio
@chapmanio the last pattern assumes the input was the beginning of the string. If it occurred at any location within a larger string it wouldn't work. To answer your other question the first regex pattern will affect all occurrences in the string. LINQ to XML would need to be tweaked to get all `Elements("ul")` to find all `ul` elements at the root level. Then `foreach` over them and get their `Descendants` and carry on as shown in my example. How the entire string looks like would make a difference. Glad you found a solution to your problem.
Ahmad Mageed
A: 

Thanks for your help with this, I have managed to work out a solution myself using your reply.

Basically I am using a counter to keep track of what level of ul tag the regex has found, and then replacing it with the relevant attribute:

Dim ulCounter As Integer = 0    
Dim rxUL As New Regex("<ul>|</ul>")

xmlValue = rxUL.Replace(xmlValue, AddressOf Convert_UL)


Protected Function Convert_UL(ByVal m As Match) As String

    Dim HTML As String = ""

    If m.Value = "</ul>" Then
        ulCounter -= 1

        HTML = "</textformat>"
    Else
        ulCounter += 1

        If ulCounter > 1 Then
            HTML = "<textformat indent=""20"">"
        Else
            HTML = "<textformat indent=""0"">"
        End If
    End If

    Return HTML

End Function

This was a pretty random request so I'm not sure how much help this would be to anyone else, but just in case that was how I got round it!

chapmanio