tags:

views:

37

answers:

1

Hi! I am working in the indexation of feeds from Internet. I would like to remove tha html code which appears in some of them. I have used regular expression for the ones i have seen, but I would like to find some way to remove all of them automaticcally, because I don't know if I have seen all possible html code in my feeds. Is there any possibillity?? I add an example of things I would like to remove: /0831/oly_g_liukin_576.jpg" height="49" width="41" /> BEIJING - AUGUST 15: Nastia Liukin of the...

Thank you!

A: 

In C# it could look something like (it will remove HTML Tags) this:

public static String RemoveHtmlTagsFromString(String source)
{
   char[] array = new char[source.Length];
   int arrayIndex = 0;
   bool inside = false;

   foreach (char let in source)
   {
       if (let == '<')
       {
           inside = true;
           continue;
       }

       if (let == '>')
       {
           inside = false;
           continue;
       }

       if (!inside)
       {
           array[arrayIndex] = let;
           arrayIndex++;
       }
   }
   return new string(array, 0, arrayIndex);
}
Lukas Šalkauskas
I am working on java, but in any case, I had a regular expresion for <tag>, but i was looking for something more efficient for remove things like I wrote in the example.Removing all the 'not beautiful' code in the feed.Tx!
Blanca
@Blanca, what do you mean by the "'not beautiful' code"? I don't think that C, or even Java, has any inherent concept of how to recognise beauty, or ugliness.
David Thomas
of course not, but with 'no beautiful' code, I mean code that is not from the feed, code which is like a complement in the feed, like /0831/oly_g_liukin_576.jpg" height="49" width="41" /> This kind of code doen't show in a normal feed, but in mine is include.
Blanca
You want to remove the tag completely or just the attributes? E.g. <tag attr="foo">text</tag> -> text / -> attr="foo" text (or something similar).
ponzao
completely, if it is possible!
Blanca
Of course it is. You can do something like Lukas suggested or you could use for instance StAX (http://stax.codehaus.org/) and stream just the characters and not the start or end elements. Probably though the document won't we be well-formed so I would go with Lukas' suggestion.
ponzao