tags:

views:

345

answers:

2

I need to fetch HTML page with all objects on it (stylesheets, javascripts, images) and store data in the database. It is possible to implement this by simple fetching files listed in src attributes, but maybe someone can suggest any helper gem for this.

Also, is there way to package all this files to one (like web archieve), which can be opened by most browsers?

Thanks

A: 

Check out Mechanize

Aaron Hinni
Thanks, gem is very useful
taro
+2  A: 

You could use mechanize to do this job:

require "rubygems"
require "mechanize"

url = "http://stackoverflow.com/"
agent = WWW::Mechanize.new
page = agent.get(url)


page.search('img[@src]').each do |image|
  src = image["src"]
  image_file = agent.get(src) if src
  # Store image_file data it in database ...  
end

page.search('link[rel="stylesheet"]').each do |css|
  src = css["src"]
  css_file = agent.get(src) if src
  # Store css_file data it in database ...  
end

page.search('script[type="text/javascript"]').each do |script|
  src = script["src"]
  script_file = agent.get(src) if src
  # Store script_file data it in database ...    
end

You still have to handle exceptions and fix resources with relative src attributes. But this should do the job. This solution will however not fetch images that are referenced in the stylesheets.

Michel de Graaf
Thanks for nice sample
taro