views:

620

answers:

5

I have a personal web site that crawls and collects MP3s from my favorite music blogs for later listening...

The way it works is a CRON job runs a .php scrip once every minute that crawls the next blog in the DB. The results are put into the DB and then a second .php script crawls the collected links.

The scripts only crawl two levels down into the page so.. main page www.url.com and links on that page www.url.com/post1 www.url.com/post2

My problem is that as I start to get a larger collection of blogs. They are only scanned once ever 20 to 30 minutes and when I add a new blog to to script there is a backup in scanning the links as only one is processed every minute.

Due to how PHP works it seems I cannot just allow the scripts to process more than one or a limited amount of links due to script execution times. Memory limits. Timeouts etc.

Also I cannot run multiple instances of the same script as they will overwrite each other in the DB.

What is the best way I could speed this process up.

Is there a way I can have multiple scripts affecting the DB but write them so they do not overwrite each other but queue the results?

Is there some way to create threading in PHP so that a script can process links at its own pace?

Any ideas?

Thanks.

A: 

CLI scripts are not limited by max execution times. Memory limits are not normally a problem unless you have large sets of data in memory at any one time. Timeouts should be handle gracefully by your application.

It should be possible to change your code so that you can run several instances at once - you would have to post the script for anyone to advise further though. As Peter says, you probably need to look at the design. Providing the code in a pastebin will help us to help you :)

David Caunt
I guess the main problem is that if I run two or more instances of say: scan.php every minute then both or each will read the same value from the DB at the same time.. the next link to crawl and so I am not getting an addition speed just errors or wrong results instead of a queue.
ian
@ian: Maybe I am thinking about this wrong: could you have one database table per site, and run a separate script to process the links in each table?
Adam Bernier
@Adam: Hrmmm. I suppose I could... The only reason for the links to be all together is so the script knows that correct one to process next...That would well because even a popular music blog is not likely to update more than once a day so it's not as if there is a bulk of links to go through.I can't automate the creation of CRON jobs with my server though so that might be an issue....
ian
+1  A: 

This surely isn't the answer to your question but if you're willing to learn python I recommend you look at Scrapy, an open source web crawler/scraper framework which should fill your needs. Again, it's not PHP but Python. It is how ever very distributable etc... I use it myself.

Hannson
Yes I have always been pretty sure PHP is not the best language to write this in but its what I know. Will take a look at scrapy.
ian
@ian: good to see you're open to using other tools. There are many ways to handle asynchronous processing in Python.
Adam Bernier
+1  A: 

Due to how PHP works it seems I cannot just allow the scripts to process more than one or a limited amount of links due to script execution times. Memory limits. Timeouts etc.

Memory limit is only a problem, if your code leaks memory. You should fix that, rather than raising the memory limit. Script execution time is a security measure, which you can simply disable for your cli-scripts.

Also I cannot run multiple instances of the same script as they will overwrite each other in the DB.

You can construct your application in such a way that instances don't override each other. A typical way to do it would be to partition per site; Eg. start a separate script for each site you want to crawl.

troelskn
+1  A: 

pseudo code for running parallel scanners:

start_a_scan(){
    //Start mysql transaction (needs InnoDB afaik)        
    BEGIN 
        //Get first entry that has timed out and is not being scanned by someone
        //(And acquire an exclusive lock on affected rows)
        $row = SELECT * FROM scan_targets WHERE being_scanned = false AND \
                (scanned_at + 60) < (NOW()+0) ORDER BY scanned_at ASC \
                      LIMIT 1 FOR UPDATE
        //let everyone know we're scanning this one, so they'll keep out
        UPDATE scan_targets SET being_scanned = true WHERE id = $row['id']
    //Commit transaction
    COMMIT
    //scan
    scan_target($row['url'])
    //update entry state to allow it to be scanned in the future again
    UPDATE scan_targets SET being_scanned = false, \
              scanned_at = NOW() WHERE id = $row['id']
}

You'd probably need a 'cleaner' that checks periodically if there's any aborted scans hanging around too, and reset their state so they can be scanned again.

And then you can have several scan processes running in parallel! Yey!

cheers!

EDIT: I forgot that you need to make the first SELECT with FOR UPDATE. Read more here

0scar
Hey thanks. Some stuff I don't know in that but I will read up on it.
ian
NP, read up, and then just pop back in here on SO and fill in the blanks :)
0scar
+1  A: 

USE CURL MULTI!

Curl-mutli will let you process the pages in parallel.

http://us3.php.net/curl

Most of the time you are waiting on the websites, doing the db insertions and html parsing is orders of magnitude faster.

You create a list of the blogs you want to scrape,Send them out to curl multi. Wait and then serially process the results of all the calls. You can then do a second pass on the next level down

http://www.developertutorials.com/blog/php/parallel-web-scraping-in-php-curl-multi-functions-375/

Byron Whitlock
Thanks I will look into that.
ian