tags:

views:

81

answers:

2

How can I use a regular expression to extract groups of html that will be formatted like this:

.

.
    .irrelevant html...
    <b>Question 6</b><br>

lots of text
<p>

lots of text
<p>
<br>

<b>Answer 6</b><br>
lots of text 
<p>

lots of text 
<p>

lots of text 
<p>

more text
<p>
<HR>

<IMG SRC="/images/image.jpg" alt="alt text" width=480 height=360 hspace=2 vspace=2> 
<p>

<i>caption text</i>

There can be a variable amount of Question-Answer pairs. And the image code can be anywhere (either between Question and Answer, or after the answer)...

The only info I want to extract is the Question #, the text sans paragraph html code, the Img src and alt and caption.

+1  A: 

I think you should look at some of the options from this question "Is there an Application to Create Regular Expression Out of Text by Selecting Wanted Area?"

ReguLazy looks like a good fit.

Sijin
+1  A: 

You might want to try using something like Watir. You can then programatically search through the dom and find what you need.

Joshua Belden