views:

282

answers:

2

Hi..I am a naive programmer in ruby..just learing to write hello worlds using ruby. i need one help in parsing text in ruby given @BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3

something like this i would like to eliminate all the hyperlinks. and get plain text. :@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands like this.

any quick help on this.

Thanks

+1  A: 
foo = "@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3"
r = foo.gsub(/http:\/\/[\w\.:\/]+/, '')
puts r
# @BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands
hobodave
A: 

It can be done in quick and dirty way or in a sophisticated way. I am showing the sophisticated way:

require 'rubygems'
require 'hpricot' # you may need to install this gem
require 'open-uri'

## first getting the embeded/framed html file's url
start_url = 'http://news.bnonews.com/u4z3'
doc = Hpricot(open(start_url))
news_html_url = doc.at('//link[@href]').to_s.match(/(http[^"]+)/) 

## now getting the news text, its in the 3rd <p> tag of the framed html file
doc2 = Hpricot(open(news_html_url.to_s))
news_text = doc2.at('//p[3]').to_plain_text
puts news_text

Try to understand what the code is doing in each step. And apply the knowledge in your future projects. Take help from these pages:

http://wiki.github.com/why/hpricot/an-hpricot-showcase

http://code.whytheluckystiff.net/doc/hpricot/

vulcan_hacker
It doesn't appear you read the question at all.
hobodave
@hobodave:I tried again and this time it appears I did misunderstand the question last time. I assumed there was bad English involved and he wants to get the text from that link. I am sorry for that. Pretty simple problem then.
vulcan_hacker