tags:

views:

60

answers:

6

HTML:

<dt>
    <a href="#profile-experience" >Past</a>
</dt>
<dd>
    <ul class="past">
        <li>
            President, CEO &amp; Founder <span class="at">at</span> China Connection
        </li>
        <li>
            Professional Speaker and Trainer <span class="at">at</span> Edgemont Enterprises
        </li>
        <li>
            Nurse &amp; Clinic Manager <span class="at">at</span> <span>USAF</span>
        </li>
    </ul>
</dd>​​​​​

I want match the <li> node. I write the Regex:

<dt>.+?Past+?</dt>\s+?<dd>\s+?<ul class=""past"">\s+?(?:<li>\s*?([\W\w]+?)+?\s*?</li>)+\s+?</ul>

In fact they do not work.

+2  A: 

No not parse HTML using a regex like it's just a big pile of text. Using a DOM parser is a proper way.

teukkam
+2  A: 

Don't use regular expressions to parse HTML...

Alex Martelli
A: 

please learn to use jQuery for this sort of thing

Scott Evernden
I don't see any suggestion in that question that JavaScript is being used, and even if there was, "use jQuery" is a rubbish answer which would need to be more specific.
David Dorward
hmmmm .. rubbish eh ? .. fascinating
Scott Evernden
"My engine is giving off steam!" "Use a spanner".
David Dorward
Please -- you are kidding me. he asked exactly 'I want match the <li> node.' .. that's precisely what jQuery is designed to do . . match nodes. Look at all the other answers indicating he should process the DOM rather than use a regex. What's jQuery designed for eh???
Scott Evernden
+1  A: 

Don't use a regular expression to match an html document. It is better to parse it as a DOM tree using a simple state machine instead.

I'm assuming you're trying to get html list items. Since you're not specifying what language you use here's a little pseudo code to get you going:

Pseudo code:

while (iterating through the text)

    if (<li> matched)

        find position to </li>
        put the substring between <li> to </li> to a variable

There are of course numerous third-party libraries that do this sort of thing. Depending on your development environment, you might have a function that does this already (e.g. javascript).

Spoike
....I's just string....in .Net/C#....
Dreampuf
thanks...I would want to do like this..
Dreampuf
A: 

Which language do you use?

If you use Python, you should try lxml: http://codespeak.net/lxml/. With lxml, you can search for the node with tag ul and class "past". You then retrieve its children, which are li, and get text of those nodes.

thx...but i want use Regex....
Dreampuf
Ok. You should do 2 steps. First, you extract the text inside tags **ul**. Then, you extract **li**. If you use Python, the code is here: http://pastebin.com/HesVF7zJ
A: 

If you are trying to extract from or manipulate this HTML, xPath, xsl, or CSS selectors in jQuery might be easier and more maintainable than a regex. What exactly is your goal and in what framework are you operating?

Peter DeWeese