views:

163

answers:

1

Hi, I try to parse RSS chaanal with simple-rss lib.

Unfortunately I got a lot of garbage in node:

 <description>&lt;p&gt;
some decryption

&lt;/p&gt;
 &lt;a href="http://url.com/trac/xxx/wiki/foo?action=diff&amp;amp;amp;version=28"&amp;gt;(diff)&amp;lt;/a&amp;gt;&lt;/description&gt;

I need to retrieve text ("some description") and optionally url.

What is the best way to do it? Regexp (if this is answer could You give me example, please?)?

+3  A: 

Thats not garbage. It is just HTML sanitized string of characters. And I am assuming by the url, you mean with the html tags(<a></a>). Following code should work.

require 'cgi'
description = "&lt;/p&gt; &lt;a href=\"http://url.com/trac/xxx/wiki/foo?action=diff&amp;amp;amp;version=28\"&amp;gt;(diff)&amp;lt;/a&amp;gt;"
CGI.unescapeHTML(description) # => </p> <a href="http://url.com/trac/xxx/wiki/foo?action=diff&amp;amp;version=28"&gt;(diff)&lt;/a&gt;

If you don't want the html tags, there are various ways to just obtain the url. A simple regex for the url should work, which I leave it to you to figure out.(Hint - Google)

Chirantan
Maciek Sawicki
Whatever suits your needs. Depends on the size of the xml. If it is too huge, I would suggest use both collectively. Use XML parser to narrow down to the node where you want to extract the url from and then use regex. But again, whatever suits your needs.
Chirantan