views:

93

answers:

7

I'm building a Rails site that, among other things, allows users to build their own recipe repository. Recipes are entered either manually or via a link to another site (think epicurious, cooks.com, etc). I'm writing scripts that will scrape a recipe from these sites given a link from a user, and so far (legal issues notwithstanding) that part isn't giving me any trouble.

However, I'm not sure where to put the code that I'm writing for these scraper scripts. My first thought was to put it in the recipes model, but it seems a bit too involved to go there; would a library or a helper be more appropriate?

Also, as I mentioned, I'm building several different scrapers for different food websites. It seems to me that the elegant way to do this would be to define an interface (or abstract base class) that determines a set of methods for constructing a recipe object given a link, but I'm not sure what the best approach would be here, either. How might I build out these OO relationships, and where should the code go?

A: 

Often, utility classes that aren't really part of the MVC design are put into the lib folder. I've also seen people put them into the models folder, but lib is really the "correct" place.

You could then create an instance of the recipe scraper within the controller as required, feeding the data into the model.

Tim Sullivan
+2  A: 

You've got two sides of this thing that are obvious. The first is how you will store the recipes, which will be models. Obviously Models will not be scraping other sites, since they have a single responsibility: storing valid data. Your controller(s), which will initiate the scraping and storage process, should not contain the scraping code either (though they will call it).

While in Ruby we don't go for abstract classes nor interfaces -- it's duck-typed, so it's enough that your scrapers implement a known method or set of methods -- your scraping engines should all be similar, especially in terms of the public methods they expose.

You will put your scrapers -- and here's the lame answer -- wherever you want. lib is fine, but if you want to make a plugin that might not be a bad idea either. See my question here - with a stunning answer by famous Rails-guy Yehuda Katz - for some other ideas, but in general: there is no right answer. There are some wrong ones, though.

Yar
A: 

Not everything in app/models has to be an ActiveRecord model. Since they directly pertain to the business logic of your application, they belong in the app directory, not the lib directory. They are also neither controller, view, or helper (helpers are there to help the views and the views alone). So, they belong in app/models. I would make sure to namespace them, just for organizational purposes into app/models/scrapers or something of that sort.

Mike Dotterer
A: 

I'd set up a rake task to scrape the site and create new rake task. Once that is working I'd use a background processor or cron job to run the rake task.

jspooner
A: 

I would create a folder in lib called scrapers. Then within that folder create one file per scraper. Call these epicurious, cooks etc. You could then define a base scrapers class that contains any shared methods that will be common to all scrapers. Similar to the followsing

lib/scrapers/base.rb

class Scrapers::base
  def shared_1()
  end
  def shared_2()
  end
  def must_implement1
    raise NotImplemented
  end
  def must_implement2
    raise NotImplemented
  end
end

lib/scrapers/epicurious.rb

Class Epicurious < Base
  def must_implement1
  end
  def must_implement2
  end
end

Then call the relevant class from within your controller using Scrapers::Epicurious.new or call a class Method within Scrapers::Base that calls the relevant implementation based upon a passed argument.

Steve Weet
A: 

The scrape engine should be a stand alone plugin or gem plugin. For the dirty and quick, you can put it inside lib. That's the usual convention anyway. It should probably implement a factory class that instantiates different types of scrapers depending on the url, so for client usage, it will be as simple as:

Scraper.scrape(url)

Also, if this is a long running task, you might want to consider using resque or delayed-jobs to offload the task to separate processes.

Aaron Qian
A: 

Try focusing on getting the stuff working first before moving it to gem/plugin. Also, forget about interface / abstract class - just write the code that does the thing. The only thing your model should know is if that's remote recipe, and what's the url. You could put all scraping code in app/scrapers. Here's an example implementation outline:

class RecipePage
  def new(url)
    @url = url
    @parser = get_parser
  end

  def get_attributes
    raise "trying to scrape unknown site" unless @parser        
    @parser.recipe_attributes(get_html)
  end

  private       
  def get_html
    #this uses your favorite http library to get html from the @url
  end

  def get_parser(url)
    #this matches url to your class, ie returns domain camelized, or nil if you are not handling particular site yet
    return EpicurusComParser
  end
end

class EpicurusComParser

  def self.recipe_attributes(html)
    # this does the hard job of querying html and moving
    # all the code to get title, text, image of recipe and return hash
    {
      :title => "recipe title",
      :text => "recipe text",
      :image => "recipe_image_url",
    }
  end
end

then in your model

class Recipe
  after_create :scrape_recipe, :if => :recipe_url

  private 

  def scrape_recipe
    # do that in background - ie in DelayedJob
    recipe_page = RecipePage.new(self.recipe_url)
    self.update_attributes(recipe_page.get_attributes.merge(:scraped => true))
  end
end

Then you can create more parser, ie CookComParser etc

Krzysztof