tags:

views:

337

answers:

6

Using the following text as a sample, I need to be able to extract text between LI tags. Notice that the first LI is intentionally mal-formed as this may be the case. Said another way, I want everything from an LI tag to either it's closing LI tag or the next LI opening tag.

    <UL>
<LI class="test">This is the first ListItem Text. 
<LI>This is the second ListItem Test. </LI></UL>

So far I have come up with:

<[Ll][Ii].*>(.*?)((?:<[Ll][Ii]>)|(?:</[Ll][Ii]>))

But this appears to be matching the first LI tag until the closing tag as one match with the group being the text of the 2nd LI tag. I've managed to get it to return the first set but never both. I'm using the "Dot matches newline" option as well and this is .NET for which I need it to work. Thanks!

UPDATE

I had done some research prior to posting this question and did in fact see and understand that using regex's to parse html is a bad idea. That being said, I only need to be able to get text from a couple LI tags here and there to determine what text to bulletize on a powerpoint slide. I thought there might be a simpler way to do it rather than dealing with a separate library, especially when use of third party libraries is tricky to deal with where I work. Unfortunately it appears that the HTML can end up mal-formed in certain situations when using an html rich text entry box on a page that allows you to bulletize text. Thanks for all of the recommendations against REGEX use for parsing HTML. I should have specified up front that I have read a lot of similar advice already but was looking for a quick work around for a simple set of circumstances.

+5  A: 

If this is a recurring scenario, I would rather use an HTML parser. Parsing HTML with Regex will take a tremendous amount of time, and might still turn out buggy, because of malformed input (that you mentioned).

Here's one I found with a basic Google search:
http://www.netomatix.com/products/Documentmanagement/HtmlParserNet.aspx

UPDATE:

Here are some related posts on StackOverflow:
http://stackoverflow.com/questions/710677/how-do-you-parse-a-poorly-formatted-html-file
http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c

Slavo
While not exactly the solution/route I wanted to have to take for this, I recognize that it really is the RIGHT answer. Thanks.
Tom
+1  A: 

As Slavo mentioned, this is difficult. The example you give is particularly tricky because the second "<LI>" needs to be treated as both the closing tag of the first match, and the opening tag of the second. This is hard.

On a totally unrelated note, you can set regex flags to be case insensitive, so that you don't have to do [Ll][Ii], etc.

Chad Birch
A: 

If your input is reasonably valid (and the list items contain text only), you might get away with:

<li[^>]*>([^<]*)

Apply as global/case insensitive and look for the contents of match group 1.

The result will need some normalization (trimming, replacing newlines).

Tomalak
Nevertheless - Regex is bad for HTML parsing, like some of the others said. This is why I said "might get away with".
Tomalak
+1  A: 

Try this.

<li.*?>(.*?)(?=</li>|<li.*?>|</ul>|\Z)

Note that you need to use the RegexOptions.IgnoreCase option for this to work, but it makes your expression much more readable.

harpo
This will break if both </li> and </ul> are missing.
Tomalak
@Tomalak: It should also pick up text to the next <li> tag, as requested, and even the rest of the string if there's no more </li>, <li> or </ul> tags. Looks exactly what the question asked for.
Whatsit
@Whatsit: I don't recognize the requirement to match up to the end of the input in the question. Where does the OP say that?
Tomalak
@Tomalak: They didn't, so I suppose technically it's not *exactly* what they asked for, but I'd expect this is what they *want*
Whatsit
+1  A: 

I feel like a broken vinyl record, but: don't use regular expressions to parse non-regular languages.

There are tons of .NET HTML parsers available, some of them also can correct malformed HTML. I googled ".net html parser malformed" and there seem to be some promising results.

Svante
+1  A: 

Regexes are bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML Parser like Html Agility Pack.

Chas. Owens