good morning! i am using c# (framework 3.5sp1) and want to parse following piece of html via regex:
<h1>My caption</h1>
<p>Here will be some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
i need following output:
- group 1: content of h1
- group 2: content of h1-following text
- group 3-n: content of subcaptions + text
what i have atm:
<hr.*?/>
<h2.*?>(.*?)</h2>
([\W\S]*?)
<hr.*?/>
this will give me every odd subcaption + content (eg. 1, 3, ...) due to the trailing <hr/>
. for parsing the h1-caption i have another pattern (<h1.*?>(.*?)</h1>
), which only gives me the caption but not the content - i'm fine with that atm.
does anybody have a hint/solution for me or any alternative logics (eg. parsing the html via reader and assigning it this way?)? any help would be appreciated - thanks in advance!
edit:
as some brought in HTMLAgilityPack, i was curious about this nice tool. i accomplished getting content of the <h1>
-tag.
but ... myproblem is parsing the rest. this is caused by: the tags for the content may vary - from <p>
to <div>
and <ul>
...
atm this seems more or less iterate over the whole document and parsing tag for tag ...?
any hints?