views:

82

answers:

2

Let presume we have something like this:

<div1>
    <h1>text1</h1>
    <h1>text2</h1>
</div1>
<div2>
    <h1>text3</h1>
</div2>

Using RegExp we need to get text1 and text2 but not text3.

How to do this?

Thanks in advance.

EDIT: This is just an example. The text I'm parsing could be just plain text. The main thing I want to accomplish is list all strings from a specific section of a document. I gave this HTML code for example as it perfectly resembles the thing I need to get.

(?siU)<h1>(.*)</h1> would parse all three strings, but how to get only first two?

EDIT2: Here is another rather dumb example. :)

Section1

This is a "very" nice sentence.
It has "just" a few words.

Section2

This is "only" an example.

The End

I need quoted words from first but not from second section.

Yet again, (?siU)"(.*)" returns quoted words from whole text, and I need only those between words Section1 and Section2.

This is for the "Rainmeter" application, which apparently uses Perl regex syntax.

I'm sorry, but I can't explain it better. :)

+2  A: 

Use a DOM library and getElementsByTagName('div') and you'll get a nodeList back. You can reference the first item with ->item(0) and then getElementsByTagName('h1') using the div as a context node, grab the text with ->nodeValue property.

meder
Ah, but he did not use `div` tags. He used `div1` and `div2` (¿etc?). :)
Brock Adams
I take it he meant to do `div` but provided the numbers to indicate first, second. And he could also just do `getElementsByTagName` on h1 and grab the first 2 nodeValues in the nodeList.
meder
Since number of `h1`'s varies, and I need all of them, grabbing only first two isn't the solution. As for `div1` and `div2` confusion, look at the second example to see what I need. :)
mmatz
+1  A: 

For the general case of the two examples provided -- for use in Rainmeter regex -- you can use:

(?siU)<h1>(.*)</h1>(?=.+<div2>) for the first sample and

(?siU)"(.*)"(?=.+Section2) for the second.

Note that Rainmeter seems to escape things for you, but you might need to change " to \", above.

These both use Positive Lookahead but beware: both solutions will fail in the case of nested tags/structures or if there are mutiple Section1's and Section2's. Regex is not the best tool for this kind of parsing.

But maybe this is good enough for your current needs?

Brock Adams
The thing is, there are nested tags and your solution doesn't work.But by modifying it I managed to solve my problem.`(?siU)<h1>(.*)</h1>.*(?=.+<div2>)` will work even if there are nested tags/structures.Than you very much. I wouldn't be able to do it without your help. :D
mmatz
@mmatz: Glad to help.
Brock Adams