views:

127

answers:

2

Hi,

I'm building a web application, and I need to use an architecture that allows me to run it on two servers.

The application scrapes information from other sites periodically, and on input from the end user. To do this I'm using Php+curl to scrape the information, Php or python to parse it and store the results in a MySQLDB.

Then I will use Python to run some algorithms on the data, this will happen both periodically and on input from the end user. I'm going to cache some of the results in the MySQL DB and sometimes if it is specific to the user, skip storing the data and serve it to the user.

I'm think of using Php for the website front end on a separate web server, running the Php spider, MySQL DB and python on another server.

As you can see I'm fairly clueless. I'm familiar with using Php, MySQL and the basics of Python, but bringing this all together using something more complex than a cron job is new to me.

How do go about implementing this? What frame work(s) should I use?
Is MVC a good architecture for this? (I'm new to MVC, architectures etc.)
Is Cakephp a good solution? If so will I be able to control and monitor the Python code using it?

+1  A: 

I think you have already a clear Idea on how to organize your layers.

First of all you would need a Web Framework for your front-end.
You have many choices here, Cakephp afaik is a good choice and it is designed to force you to follow the design pattern MVC.
Then, you would need to design your database to store what users want to be spidered.
Your db will be accessed by your web application to store users requests, by your php script to know what to scrape and finally by your python batch to confirm to the users that the data requested is available.

A possible over-simplified scenario:

  1. User register to your site
  2. User commands to grab a random page from Wikipedia
  3. Request is stored though CakePhp application to db
  4. Cron php batch starts and checks db for new requests
  5. Batch founds new request and scrapes from Wikipedia
  6. Batch updates db with a scraped flag
  7. Cron python batch starts and checks db for new scraped flag
  8. Batch founds new scraped flag and parse Wikipedia to extract some tags
  9. Batch updates db with a done flag
  10. User founds the requested information on his profile.
systempuntoout
+1  A: 

How do go about implementing this?

Too big a question for an answer here. Certainly you don't want 2 sets of code for the scraping (1 for scheduled, 1 for demand) in addition to the added complication, you really don't want to be running job which will take an indefinite time to complete within the thread generated by a request to your webserver - user requests for a scrape should be run via the scheduling mechanism and reported back to users (although if necessary you could use Ajax polling to give the illusion that it's happening in the same thread).

What frame work(s) should I use?

Frameworks are not magic bullets. And you shouldn't be choosing a framework based primarily on the nature of the application you are writing. Certainly if specific, critical functionality is precluded by a specific framework, then you are using the wrong framework - but in my experience that has never been the case - you just need to write some code yourself.

using something more complex than a cron job

Yes, a cron job is probably not the right way to go for lots of reasons. If it were me I'd look at writing a daemon which would schedule scrapes (and accept connections from web page scripts to enqueue additional scrapes). But I'd run the scrapes as separate processes.

Is MVC a good architecture for this? (I'm new to MVC, architectures etc.)

No. Don't start by thinking whether a pattern fits the application - patterns are a useful tool for teaching but describe what code is not what it will be

(Your application might include some MVC patterns - but it should also include lots of other ones).

C.

symcbean