I need to parse thousands of feeds and performance is an essential requirement. Do you have any suggestions?
Thanks in advance!
I need to parse thousands of feeds and performance is an essential requirement. Do you have any suggestions?
Thanks in advance!
Not sure about the performance, but a similar question was answered at http://stackoverflow.com/questions/214590/parsing-atom-rss-in-ruby-rails
You might also look into Hpricot, which parses XML but assumes that it's well-formed and doesn't do any validation.
http://wiki.github.com/why/hpricot http://wiki.github.com/why/hpricot/hpricot-xml
I haven't tried it, but I read about Feedzirra recently (it claims to be built for performance) :-
Feedzirra is a feed library that is designed to get and update many feeds as quickly as possible. This includes using libcurl-multi through the taf2-curb gem for faster http gets, and libxml through nokogiri and sax-machine for faster parsing.
You can use RFeedParser, a Ruby-port of (famous) Python Universal FeedParser. It's based on Hpricot, and it's really fast and easy to use.
http://rfeedparser.rubyforge.org/
An example:
require 'rubygems'
require 'rfeedparser'
require 'open-uri'
feed = FeedParser::parse(open('http://feeds.feedburner.com/engadget'))
feed.entries.each do |entry|
puts entry.title
end
When all you have is a hammer, everything looks like a nail. Consider a solution other than Ruby for this. Though I love Ruby and Rails and would not part with them for web development or perhaps for a domain specific language, I prefer heavy data lifting of the type you describe be performed in Java, or perhaps Python or even C++.
Given that the destination of this parsed data is likely a database it can act as the common point between the Rails portion of your solution and the other language portion. Then you're using the best tool to solve each of your problems and the result is likely easier to work on and truly meets your requirements.
If speed is truly of the essence, why add an additional constraint on there and say, "Oh, it's only of the essence as long as I get to use Ruby."