views:

174

answers:

2

good morning! i am using c# (framework 3.5sp1) and want to parse following piece of html via regex:

<h1>My caption</h1>
<p>Here will be some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>

i need following output:

  • group 1: content of h1
  • group 2: content of h1-following text
  • group 3-n: content of subcaptions + text

what i have atm:

<hr.*?/>
<h2.*?>(.*?)</h2>
([\W\S]*?)
<hr.*?/>

this will give me every odd subcaption + content (eg. 1, 3, ...) due to the trailing <hr/>. for parsing the h1-caption i have another pattern (<h1.*?>(.*?)</h1>), which only gives me the caption but not the content - i'm fine with that atm.

does anybody have a hint/solution for me or any alternative logics (eg. parsing the html via reader and assigning it this way?)? any help would be appreciated - thanks in advance!

edit:
as some brought in HTMLAgilityPack, i was curious about this nice tool. i accomplished getting content of the <h1>-tag.
but ... myproblem is parsing the rest. this is caused by: the tags for the content may vary - from <p> to <div> and <ul>... atm this seems more or less iterate over the whole document and parsing tag for tag ...? any hints?

+6  A: 

You will really need HTML parser for this

S.Mark
+5  A: 

Don't use regex to parse HTML. Consider using the HTML Agility Pack.

Mark Byers
it's a bit hard for me, to parse this piece with HTMLAgilityPack, as i do not know which patterns the content-areas have (once they are `<ul>`, then `<p>` and once simply `<div>`). can you give me some hooks? :)
Andreas Niedermair