ansaurus

Question

Answer 1

+1 A:

You're creating upwards of twice as many database queries as there are links on the page, I'd say that's where your problem is. Try to just accumulate the links into an array, do one big batch-query to filter out the duplicates and insert new records with a saveAll().

Actually, looking at it again, you're recursively crawling all links as well, but without any depth limit or abort condition. In other words, the script will continue as long as there are links to follow, which is potentially infinite. You should just process one page at a time and crawl further links in another instance, for example using a queue/worker pattern.

deceze 2010-01-21 05:25:15

Thanks for your feedback. Any tips on implementing this? The concept is relatively straight forward but I'm unsure on creating a separate instance. Would I, for example, have to call EXAMPLE.COM/products/crawl from within the script to run a separate instance?

KThompson 2010-01-21 17:19:10

No, you'd rather work with cron jobs or a daemon. There are many threads here on SO to get you started: http://stackoverflow.com/search?q=php+queue+worker

deceze 2010-01-22 01:22:20

ansaurus

tags:

views:

answers:

CakePHP Web Crawler Memory Leak

related questions