views:

66

answers:

3

Hello!

Im trying to figure out a way to strip out all html tags from records in a database, then create xml?

Any ideas?

Built on asp.net 2.0 with sql server

+1  A: 

Check this question : Using C# regular expressions to remove HTML tags. What exactly did you mean by creating xml?

Shoban
Well, we need to deliver an xml feed of all our products to a vendor and they want us to strip out all the html characters. So Im wondering if there is an easy way to do that?
jrutter
A: 

Why not just parse the page, ensuring that you make it into a DOM tree, and then just go through the elements pulling out the appropriate values that you need, and perhaps any attributes you deem necessary.

If you wrote the html files then they should be well-formed, so this would be easy.

James Black
I like this answer. You need to implement DOM objects and parser, after all HTML is some soft of XML. You basically need to convert the HTML tags to XML tags, so while you are parsing it, you can replace the HTML tags with XML tags.
A: 

Don't strip the HTML with the database or with sql. Instead, strip it out at the last mile in your application code with a scraper.

Google this: "HTML Scraper". HTML screen scraping tools read HTML content and output the content, less the HTML. Or, alternatively, Stack Overflow this: "Screen-scraping HTML".

Mike Atlas
Don't tell him, google this (even if thats what he should have done), point him straight here at stackoverflow for that ;) http://stackoverflow.com/search?q=html+scraper
voyager