ansaurus

Question

crawling scraping and threading? with php

Answer 1

A:

CLI scripts are not limited by max execution times. Memory limits are not normally a problem unless you have large sets of data in memory at any one time. Timeouts should be handle gracefully by your application.

It should be possible to change your code so that you can run several instances at once - you would have to post the script for anyone to advise further though. As Peter says, you probably need to look at the design. Providing the code in a pastebin will help us to help you :)

David Caunt 2009-06-08 17:09:33

I guess the main problem is that if I run two or more instances of say: scan.php every minute then both or each will read the same value from the DB at the same time.. the next link to crawl and so I am not getting an addition speed just errors or wrong results instead of a queue.

ian 2009-06-08 17:28:15

@ian: Maybe I am thinking about this wrong: could you have one database table per site, and run a separate script to process the links in each table?

Adam Bernier 2009-06-08 17:30:09

@Adam: Hrmmm. I suppose I could... The only reason for the links to be all together is so the script knows that correct one to process next...That would well because even a popular music blog is not likely to update more than once a day so it's not as if there is a bulk of links to go through.I can't automate the creation of CRON jobs with my server though so that might be an issue....

ian 2009-06-08 17:34:34

Answer 2

+1 A:

This surely isn't the answer to your question but if you're willing to learn python I recommend you look at Scrapy, an open source web crawler/scraper framework which should fill your needs. Again, it's not PHP but Python. It is how ever very distributable etc... I use it myself.

Hannson 2009-06-08 17:52:15

Yes I have always been pretty sure PHP is not the best language to write this in but its what I know. Will take a look at scrapy.

ian 2009-06-08 17:54:35

@ian: good to see you're open to using other tools. There are many ways to handle asynchronous processing in Python.

Adam Bernier 2009-06-08 18:01:49

Answer 3

+1 A:

Due to how PHP works it seems I cannot just allow the scripts to process more than one or a limited amount of links due to script execution times. Memory limits. Timeouts etc.

Memory limit is only a problem, if your code leaks memory. You should fix that, rather than raising the memory limit. Script execution time is a security measure, which you can simply disable for your cli-scripts.

Also I cannot run multiple instances of the same script as they will overwrite each other in the DB.

You can construct your application in such a way that instances don't override each other. A typical way to do it would be to partition per site; Eg. start a separate script for each site you want to crawl.

troelskn 2009-06-08 18:20:06

Answer 4

+1 A:

pseudo code for running parallel scanners:

start_a_scan(){
    //Start mysql transaction (needs InnoDB afaik)        
    BEGIN 
        //Get first entry that has timed out and is not being scanned by someone
        //(And acquire an exclusive lock on affected rows)
        $row = SELECT * FROM scan_targets WHERE being_scanned = false AND \
                (scanned_at + 60) < (NOW()+0) ORDER BY scanned_at ASC \
                      LIMIT 1 FOR UPDATE
        //let everyone know we're scanning this one, so they'll keep out
        UPDATE scan_targets SET being_scanned = true WHERE id = $row['id']
    //Commit transaction
    COMMIT
    //scan
    scan_target($row['url'])
    //update entry state to allow it to be scanned in the future again
    UPDATE scan_targets SET being_scanned = false, \
              scanned_at = NOW() WHERE id = $row['id']
}

You'd probably need a 'cleaner' that checks periodically if there's any aborted scans hanging around too, and reset their state so they can be scanned again.

And then you can have several scan processes running in parallel! Yey!

cheers!

EDIT: I forgot that you need to make the first SELECT with FOR UPDATE. Read more here

0scar 2009-06-08 18:43:04

Hey thanks. Some stuff I don't know in that but I will read up on it.

ian 2009-06-08 18:45:02

NP, read up, and then just pop back in here on SO and fill in the blanks :)

0scar 2009-06-08 19:15:48

Answer 5

+1 A:

USE CURL MULTI!

Curl-mutli will let you process the pages in parallel.

http://us3.php.net/curl

Most of the time you are waiting on the websites, doing the db insertions and html parsing is orders of magnitude faster.

You create a list of the blogs you want to scrape,Send them out to curl multi. Wait and then serially process the results of all the calls. You can then do a second pass on the next level down

http://www.developertutorials.com/blog/php/parallel-web-scraping-in-php-curl-multi-functions-375/

Byron Whitlock 2009-06-08 19:17:36

Thanks I will look into that.

ian 2009-06-09 05:52:39

ansaurus

tags:

views:

answers:

crawling scraping and threading? with php

related questions