views:

410

answers:

1

Hi Guys!

I need to scrape a simple webpage which has the following text:

Value=29 Time=128769

The values change frequently.

I want to extract the Value (29 in this case) and store it in a database. I want to scrape this page every 6 hours. I am not interested in displaying the value anywhere, I just am interested in the cron. Hope I made sense.

Please advise me if I can accomplish this using Google's App Engine.

Thank you!

+2  A: 

Please advise me if I can accomplish this using Google's App Engine.

Sure! E.g., in Python, urlfetch (with the URL as argument) to get the contents, then a simple re.search(r'Value=(\d+)').group(1) (if the contents are as simple as you're showing) to get the value, and a db.put to store it. Do you want the Python details spelled out, or do you prefer Java?

Edit: urllib / urllib2 would also be feasible (GAE does support them now).

So cron.yaml should be something like:

cron:
- description: refresh "value"
  url: /refvalue
  schedule: every 6 hours

and app.yaml something like:

application: valueref
version: 1
runtime: python
api_version: 1

handlers:
- url: /refvalue
  script: refvalue.py
  login: admin

You may have other entries in either or both, of course, but this is the subset needed to "refresh the value". A possible refvalue.py might be:

import re
import wsgiref.handlers

from google.appengine.ext import db
from google.appengine.ext import webapp
from google.appengine.api import urlfetch

class Value(db.Model):
  thevalue = db.IntegerProperty()
  when = db.DateTimeProperty(auto_now_add=True)

class RefValueHandler(webapp.RequestHandler):
  def get(self):
    resp = urlfetch.fetch('http://whatever.example.com')
    mo = re.match(r'Value=(\d+)', resp.content)
    if mo:
      val = int(mo.group(1))
    else:
      val = None
    valobj = Value(thevalue=val)
    valobj.put()

def main():
  application = webapp.WSGIApplication(
    [('/refvalue', RefValueHandler),], debug=True)
  wsgiref.handlers.CGIHandler().run(application)

if __name__ == '__main__':
  main()

Depending on what else your web app is doing, you'll probably want to move the class Value to a separate file (e.g. models.py with other models) which of course you'll then have to import (from this .py file and from others which do something interesting with all of your saved values). Here I've taken some possible anomalies into account (no Value= found on the target page) but not others (the target page's server does not respond or gives an error); it's hard to know exactly what anomalies you need to consider and what you want to do if they occur (what I'm doing here is very simply recording None as the value at the anomaly's time, but you may want to do more... or less -- I'll leave that up to you!-)

Alex Martelli
I am interested in Python version, thank you!I was trying to use urllib2, httplib2. SO urlfetch is the way to go is it? The contents are really that simple. Unique variables holding dynamic values.Value=29 Time=128769 a=39 b=129 c=9 d=12 e=29 f=659 g=279 h=5769 i=43 k=128 j=29 l=769 m=29 n=187So, The imports should be re and the urlfetch library, is it? I am so sorry, I am a Python newbie, I wish Google App engine supported more languages or better yet an access to Linux terminal!
ThinkCode
@NJTechGuy, GAE does support Java (and indirectly through that many languages that can be implemented on top of Java, such as JRuby, Scala, Groovy, Clojure -- see http://groups.google.com/group/google-appengine-java/web/will-it-play-in-app-engine for a list of 8+ languages known to run on App Engine... are you *seriously* saying that those TEN-plus languages just aren't enough and that's truly limiting you?!-). No idea what "access to Linux terminal" means in this context. Anyway, edited the answer to include a complete solution.
Alex Martelli
Oh my god Alex! You are a genius! So many forums out there hate to help newbies. You provided me a full solution! I am yet to test it but I am sure it will work :) I never liked Java :( Preferred C# back then. Looks like I have to gain some ground on Java, keep hearing a lot about it! I would love PERL compatibility on GAE. I am kinda new to the open source world, I love it already! With sites like these, I should be on top in no time! Thank you so much :)
ThinkCode
p.s : Is there anyway I could be alerted immediately when an answer is posted/edited? Daily digest works but an instant alert will be awesome! Just a thought!
ThinkCode
I believe you can set up RSS feeds, but I haven't personally done so; check out meta.stackoverflow.com.
Alex Martelli