views:

515

answers:

5

How do I parse HTML using regular expressions in C#?

For example, given HTML code

<s2> t1 </s2>  <img src='1.gif' />  <span> span1 <span/>

I am trying to obtain

1.  <s2>
2.  t1
3. </s2>
4. <img src='1.gif' />
5. <span>
6. span1
7. <span/>

How do I do this using regular expressions in C#?

In my case, the HTML input is not well-formed XML like XHTML. Therefore I can not use XML parsers to do this.

+6  A: 

Regular expressions are a very poor way to parse HTML. If you can guarantee that your input will be well-formed XML (i.e. XHTML), you can use XmlReader to read the elements and then print them out however you like.

bobbymcr
In my case, the input is NOT well-formed xml.
Mike108
Then you're in for a very complex problem, in general... HTML parsing with all of its implied elements, optional end tags, etc. is no fun. However, you might be able to leverage an existing library, such as... http://www.codeplex.com/htmlagilitypack
bobbymcr
No, regular expressions are *not* "a poor way to parse HTML", because that would imply that regular expressions can parse HTML *at all*, which is not the case. It is mathematically proven that regular expressions *cannot* parse HTML. In fact, pretty much every college student has to prove this at some point during a homework assignment or exam or something.
Jörg W Mittag
Heh, fair enough.
bobbymcr
+3  A: 

You might want to try the Html Agility Pack, http://www.codeplex.com/htmlagilitypack. It even handles malformed HTML.

nickyt
A: 

you might want to simply use string functions. make < and > as your indicator for parsing.

junmats
+4  A: 

This has already been answered literally dozens of times, but it bears repeating: regular expressions can only parse regular languages, that's why they are called regular expressions. HTML is not a regular language (as probably every college student in the last decade has proved at least once), and therefore cannot be parsed by regular expressions.

Jörg W Mittag
A: 

I used this regx in C#, and it works. Thanks for all your answers.

<([^<]*)>|([^<]*)
Mike108
It works with the data you've tested it with. If that's all the data you ever need to process with it, then fine.
Robert Rossney
If not: now you've got two problems.
Peter Hoffmann
<!-- <b>Your regex will not work with HTML comments</b> -->
DrJokepu