views:

103

answers:

2

I'm test building a scraping site with django. For some reason the following code is only providing one picture image where i'd like it to print every image, every link, and every price, any help? (also, if you guys know how to place this data into a database model so I don't have to always scrape the site, i'm all ears but that may be another question) Cheers!

Here is the template file:

{% extends "base.html" %}

{% block title %}Boats{% endblock %}

{% block content %}

<img src="{{ fetch_boats }}"/>

{% endblock %}

Here is the views.py file:

#views.py
from django.shortcuts import render_to_response
from django.template.loader import get_template
from django.template import Context
from django.http import Http404, HttpResponse
from fetch_images import fetch_imagery

def fetch_it(request):
    fi = fetch_imagery()
    return render_to_response('fetch_image.html', {'fetch_boats' : fi})

Here is the fetch_images module:

#fetch_images.py
from BeautifulSoup import BeautifulSoup
import re
import urllib2

def fetch_imagery():
    response = urllib2.urlopen("http://www.boattrader.com/search-results/Type")
    html = response.read()

#create a beautiful soup object
    soup = BeautifulSoup(html)

#all boat images have attribute height=165
    images = soup.findAll("img",height="165")
    for image in images:
        return image['src'] #print th url of the image only

# all links to detailed boat information have class lfloat
    links = soup.findAll("a", {"class" : "lfloat"})
    for link in links:
        return link['href']
        #print link.string

# all prices are spans and have the class rfloat
    prices = soup.findAll("span", { "class" : "rfloat" })
    for price in prices:
        return price
        #print price.string

Lastly, if needed the mapped url in urlconf is below:

from django.conf.urls.defaults import *
from mysite.views import fetch_it

urlpatterns = patterns('', ('^fetch_image/$', fetch_it))
+1  A: 

Your fetch_imagery function needs some work - since you're returning (instead of using yield), the first return image['src'] will terminate the function call (I'm assuming here that all those returns are part of the same function definition as shown by your code).

Also, my assumption is that you will be returning a list/tuple (or defining a generator method) from fetch_imagery in which case your template needs to look like:

{% block content %}
    {% for image in fetch_boats %}
        <img src="{{ image }}" />
    {% endfor %}
{% endblock %}

This will basically loop over all items (image urls in your case) in your list and will create img tags for each one of them.

Rishabh Manocha
Thanks Rishabh, I hadn't seen the yield statement before (still rather newbie)... for anyone else, here's a great answer for the yield statement: http://stackoverflow.com/questions/231767/can-somebody-explain-me-the-python-yield-statement
Diego
+2  A: 

Out of the scope, but to my mind, scrapping is an excessive cpu time / memory / bandwith consumming, and I think it should be done in a background in asynchronous maneer.

It's a great idea though :)

dzen
out of scope? how is this done asynchronously? The app i'd like to create requires real-time data as it is in kayak.com.. is that an asynchronous scraper? still learning*.. Thanks!
Diego