ansaurus

Question

Use REGEX to find Contents of HTML ListItem (.NET)

Answer 1

+5 A:

If this is a recurring scenario, I would rather use an HTML parser. Parsing HTML with Regex will take a tremendous amount of time, and might still turn out buggy, because of malformed input (that you mentioned).

Here's one I found with a basic Google search:
http://www.netomatix.com/products/Documentmanagement/HtmlParserNet.aspx

UPDATE:

Here are some related posts on StackOverflow:
http://stackoverflow.com/questions/710677/how-do-you-parse-a-poorly-formatted-html-file
http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c

Slavo 2009-04-21 15:00:16

While not exactly the solution/route I wanted to have to take for this, I recognize that it really is the RIGHT answer. Thanks.

Tom 2009-05-27 12:36:24

Answer 2

+1 A:

As Slavo mentioned, this is difficult. The example you give is particularly tricky because the second "<LI>" needs to be treated as both the closing tag of the first match, and the opening tag of the second. This is hard.

On a totally unrelated note, you can set regex flags to be case insensitive, so that you don't have to do [Ll][Ii], etc.

Chad Birch 2009-04-21 15:01:46

Answer 3

A:

If your input is reasonably valid (and the list items contain text only), you might get away with:

<li[^>]*>([^<]*)

Apply as global/case insensitive and look for the contents of match group 1.

The result will need some normalization (trimming, replacing newlines).

Tomalak 2009-04-21 15:03:15

Nevertheless - Regex is bad for HTML parsing, like some of the others said. This is why I said "might get away with".

Tomalak 2009-04-21 15:10:30

Answer 4

+1 A:

Try this.

<li.*?>(.*?)(?=</li>|<li.*?>|</ul>|\Z)

Note that you need to use the RegexOptions.IgnoreCase option for this to work, but it makes your expression much more readable.

harpo 2009-04-21 15:03:50

This will break if both </li> and </ul> are missing.

Tomalak 2009-04-21 15:14:55

@Tomalak: It should also pick up text to the next <li> tag, as requested, and even the rest of the string if there's no more </li>, <li> or </ul> tags. Looks exactly what the question asked for.

Whatsit 2009-04-21 15:17:42

@Whatsit: I don't recognize the requirement to match up to the end of the input in the question. Where does the OP say that?

Tomalak 2009-04-21 15:21:32

@Tomalak: They didn't, so I suppose technically it's not *exactly* what they asked for, but I'd expect this is what they *want*

Whatsit 2009-04-21 15:24:30

Answer 5

+1 A:

I feel like a broken vinyl record, but: don't use regular expressions to parse non-regular languages.

There are tons of .NET HTML parsers available, some of them also can correct malformed HTML. I googled ".net html parser malformed" and there seem to be some promising results.

Svante 2009-04-21 15:04:51

Answer 6

+1 A:

Regexes are bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML Parser like Html Agility Pack.

Chas. Owens 2009-04-21 15:05:06

ansaurus

tags:

views:

answers:

Use REGEX to find Contents of HTML ListItem (.NET)

related questions