tags:

views:

746

answers:

4

Hey,

Ive made a small program in C#.net which doesnt really serve much of a purpose, its tells you the chance of your DOOM based on todays news lol. It takes an RSS on load from the BBC website and will then look for key words which either increment of decrease the percentage chance of DOOM.

Crazy little project which maybe one day the classes will come uin handy to use again for something more important.

I recieve the RSS in an xml format but it contains alot of div tags and formatting characters which i dont really want to be in the database of keywords,

What is the best way of removing these unwanted characters and div's?

Thanks,

Ash

+4  A: 

IMHO the easiest way is to use regular expressions. Something like:

string txt = Regex.Replace(htmlString, @"<(.|\n)*?>", string.Empty);

Depending on which tags and characters you want to remove you will modify the regex, of course. You will find a lot of material on this and other methods if you do a web search for 'strip html C#'.

SO question Render or convert Html to ‘formatted’ Text (.NET) might help you, too.

f3lix
This is the answer worked but it basically removes all the chatacters then puts in their place a blank space which once split into an array give alot of white space in the database. How do i solve that? Also is there any way of adding a parameter to this to remove characters like /n and /t?
Ash
Not sure why you're seeing *extra* blank space - string.Empty would be replacing the tags with "", not " ". it's possible that you're not stripping out the excess whitespace (tabs "\t", newlines "\n", etc) in the RSS - you might want to look at doing a further replace for those, or adding them.
Zhaph - Ben Duguid
+2  A: 

Stripping HTML tags from a given string is a common requirement and you can probably find many resources online that do it for you.

The accepted method, however, is to use a Regular expression based Search and Replace. This article provides a good sample along with benchmarks. Another point worth mentioning is that you would require separate Regex based lookups for the different kinds of unwanted characters you are seeing. (Perhaps showing us an example of the HTML you receive would help)

Note that your requirements may vary based on which tags you want to remove. In your question, you only mention DIV tags. If that is the only tag you need to replace, a simple string search and replace should suffice.

AmitK
Coool, thanks guys!
Ash
Lol, I cannot even vote for anyone yet!
AmitK
+1 so you can start voting sooner ;)
Christian Witts
+1  A: 

A regular expression such as this:

<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>

Would highlight all HTML tags.

Use this to remove them form your data.

Jon Winstanley
Is there a certain order you must put the characters in when writing a regular expression?The answer towards the top is that a lighter expression? Or does it not remove all characters?
Ash
To be honest, the regex I mentioned here will remove all content within tags as well. This may not be what you want.
Jon Winstanley
+4  A: 

If you want to remove the DIV tags WITH content as well:

string start = "<div>";
string end = "</div>";
string txt = Regex.Replace(htmlString, Regex.Escape(start) + "(?<data>[^" + Regex.Escape(end) + "]*)" + Regex.Escape(end), string.Empty);

Input: <xml><div>junk</div>XXX<div>junk2</div></xml>

Output: <xml>XXX</xml>

Wolf5
Ohhh okays i see so your defining the start and end tag and erasing all of it basically! Thats awesome exactly what i needed thanks!
Ash