ansaurus

Question

Regular Expression to find the start end of a list in HTML

Answer 1

+6 A:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski

Regular expressions and HTML are a particularly bad fit.
This is 2009, use closing tags in your HTML. (That alone will help you, if you really want to regex your html.
If you've already got this page inside a browser, use the DOM! Let the browser parse the HTML for you (shove it into a hidden div if you must) and navigate the resulting DOM tree.

James Emerton 2009-05-06 22:22:25

Answer 2

+5 A:

Don't parse HTML with regexes. Instead, use a real HTML parser.

Sorry if my answer feels insubstantial, but this question is asked almost every day, and your requirements are (in my opinion) far too complicated for regular expressions.

Also, none of your tags are closed. You should probably write that like this:

<p>paragraph goes here</p>

<li>goes here</li>
<li>list item 2</li>
<li>list item 3</li>

<p>another paragraph</p>

My HTML may be off, but you should really close all your tags.

Chris Lutz 2009-05-06 22:22:54

Answer 3

+1 A:

I agree with James and Chris, in general it's really a lot better to use a proper parser, I've seen people fail badly doing it any other way (I'm assuming you don't have full control over the HTML input here, in which case a shortcut like regex might work fine).

Let's assume you're using Java for the moment. If you know that your input is valid XHTML instead of HTML, you can use the Java API for XML Processing (JAXP), which comes with the Sun Java JDK. Then in a few lines you can parse your XHTML into a DOM tree and reach down to pick out the list's node and do whatever you like with it. There's a learning curve to JAXP, but it's well worth it.

If you are using Groovy, there's XMLSlurper. Ruby has several good XML libraries. PHP has the XMLParser extension. Python has Beautiful Soup. Pretty much any modern language has good alternatives to choose from.

Now based on your example, you don't have properly XML-ized XHTML, but wild-and-wooly HTML with unclosed tags and other nasties. If that's the case, you'll need to grab an HTML parser library, something on the order of HTMLParser. Good luck!

Jim Ferrans 2009-05-06 22:44:45

Answer 4

A:

Assuming all elements have end tags, and nobody got clever by adding spaces inside start or end tags, and that some elements precede the list items, all you have to do it something like (in Perl syntax, probably compatible with a PCRE library, minus the m// operator):

m/(?<!li)>[^<]*<li/i

to identify the first list item in a group. Exploded (with the x flag, for readability):

m/
    (?<!li)> # the end of a start or end tag that isn't part of an li element
    [^<]*    # some non-angle-bracket characters -- in-between tag content
    <li      # the beginning of an li element
/xi          # space insensitive, case insensitive (respectively)

And then you could go through the next block more confident that nothing will likely be between list items until you read its end, save that position, and use this pattern again.

Figuring out where it ends is trickier without a parser. You could use something like (this is abridged)

m/(?<=<li).*?<(div|form|p)/i

where you list all the non-inline elements, which will trigger the li and ul to be closed and end the overall list. But the other way for the list to close implicity is for the container to close.

If the list-item elements themselves are well-formed (have closing tags), then this might be sufficient for placing the lists's closing tag:

m{</li>.*?<(?!li)}i

Anonymous 2009-05-07 01:37:47

ansaurus

tags:

views:

answers:

Regular Expression to find the start end of a list in HTML

related questions