views:

1298

answers:

8

I am trying to scrape some website and republish the data as a RSS feed. How hard is this to setup with Google App Engine? Disadvantages and Advantages using GAE. Any recommendations and guidelines greatly appreciated!

+1  A: 

Google AppEngine offers much more functionality (and complexity) than you will need if truly all you will want to do is republish some structured data as RSS. Personally, I would use something like Yahoo pipes for a task like this.

That being said... if you want/need to get your feet wet with GAE, go for it!

Aram Verstegen
+1  A: 

Harder than it would be in most other technologies.

GAE can sort of do scheduled batch stuff like this now, but it's really not intended for that type of thing. Pick pretty much any other language and platform for this particular task, and you'll make your life a lot easier.

Jason Kester
A: 

I have never looked into Yahoo Pipes. I'll defintely do that. Does Pipes allow me to manipulate HTML? I am going to read the yahoo pipe docs

A: 

Working with Google App Engine is pretty straight forward. I would recommend going through the Getting Started guide. It's short and simple and touches on essential GAE topics. There are more pros and cons than I will list here.

Pros:
In general, App Engine is designed for high traffic web applications that need to scale. Furthermore, it is designed from a programmer's perspective. Much of the scalability issues (database optimization, server administration, etc) are dealt with by Google. Having said that, I find it to be a nice platform. It is still being actively developed by Google engineers, and scheduling of tasks (a feature that has been long requested) is in the current road map.

Cons:
Perhaps the biggest downside right now is again the lack of official scheduling support and the quota limits currently set for free accounts. However you can't complain much if its free. Currently it only supports Python as a programming interface (although a new language [Java I predict] is coming soon). Furthermore, Python 2.6 (and 3.0 for that matter) are not yet supported. In addition, Django 1.0 is not officially supported in App Engine (although you can package Django 1.0 with your application).

fuentesjr
A: 

Thanks for the replies. So what do you guys recommend me using ? I need to go out to the internet and parse 50 websites (some rss feeds some html) and create a big RSS feed for many clients to consume.

A: 

I would avoid this problem like the plague.

Web scraping is a world of pain. The websites will change underneath you and you will always be updating your code to deal with it.

If you are confortable with constant maintenance to get a the text of a div, then I wish you good luck.

Genericrich
A: 

I think BeautifulSoup could run on GAE, so all your scraping needs are handled :D Also, GAE has a geturl thingy. The only problem I think you might have is not having enough time to get the data (30 secs limitation).

I am working on a same project and I've decided that it's easier to prepare the data on another server and push them to GAE.

Jon Romero
A: 

You might also want to look into Yahoo! Query Language (YQL)

lupefiasco