ansaurus

Question

Answer 1

+4 A:

IMHO the easiest way is to use regular expressions. Something like:

string txt = Regex.Replace(htmlString, @"<(.|\n)*?>", string.Empty);

Depending on which tags and characters you want to remove you will modify the regex, of course. You will find a lot of material on this and other methods if you do a web search for 'strip html C#'.

SO question Render or convert Html to ‘formatted’ Text (.NET) might help you, too.

f3lix 2009-03-30 10:30:15

This is the answer worked but it basically removes all the chatacters then puts in their place a blank space which once split into an array give alot of white space in the database. How do i solve that? Also is there any way of adding a parameter to this to remove characters like /n and /t?

Ash 2009-03-30 10:35:55

Not sure why you're seeing *extra* blank space - string.Empty would be replacing the tags with "", not " ". it's possible that you're not stripping out the excess whitespace (tabs "\t", newlines "\n", etc) in the RSS - you might want to look at doing a further replace for those, or adding them.

Zhaph - Ben Duguid 2009-03-30 10:42:28

Answer 2

+2 A:

Stripping HTML tags from a given string is a common requirement and you can probably find many resources online that do it for you.

The accepted method, however, is to use a Regular expression based Search and Replace. This article provides a good sample along with benchmarks. Another point worth mentioning is that you would require separate Regex based lookups for the different kinds of unwanted characters you are seeing. (Perhaps showing us an example of the HTML you receive would help)

Note that your requirements may vary based on which tags you want to remove. In your question, you only mention DIV tags. If that is the only tag you need to replace, a simple string search and replace should suffice.

AmitK 2009-03-30 10:37:47

Coool, thanks guys!

Ash 2009-03-30 10:40:36

Lol, I cannot even vote for anyone yet!

AmitK 2009-03-30 10:48:21

+1 so you can start voting sooner ;)

Christian Witts 2009-03-30 11:57:36

Answer 3

+1 A:

A regular expression such as this:

<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>

Would highlight all HTML tags.

Use this to remove them form your data.

Jon Winstanley 2009-03-30 10:41:49

Is there a certain order you must put the characters in when writing a regular expression?The answer towards the top is that a lighter expression? Or does it not remove all characters?

Ash 2009-03-30 10:46:21

To be honest, the regex I mentioned here will remove all content within tags as well. This may not be what you want.

Jon Winstanley 2009-03-30 13:38:28

Answer 4

+4 A:

If you want to remove the DIV tags WITH content as well:

string start = "<div>";
string end = "</div>";
string txt = Regex.Replace(htmlString, Regex.Escape(start) + "(?<data>[^" + Regex.Escape(end) + "]*)" + Regex.Escape(end), string.Empty);

Input: <xml><div>junk</div>XXX<div>junk2</div></xml>

Output: <xml>XXX</xml>

Wolf5 2009-03-30 10:46:57

Ohhh okays i see so your defining the start and end tag and erasing all of it basically! Thats awesome exactly what i needed thanks!

Ash 2009-03-30 10:56:52

ansaurus

tags:

views:

answers:

Removing <div>'s from text file?

related questions