I've been doing some scraping with PHP and getting some strange results on a particular domain. For example, when I download this page:
http://pitchfork.com/reviews/tracks/
It works fine. However if I try to download this page:
http://pitchfork.com/reviews/tracks/1/
It returns an incomplete page, even though the content is exactly the same. All subsequent pages (tracks/2/, tracks/3/, etc) also return incomplete data.
It seems to be a problem with the way the URLs are formed during pagination. Most other sections on the site exhibit the same behaviour (the landing page works, but not subsequent pages). One exception is this section:
http://pitchfork.com/forkcast/
Where forkcast/2/ etc work fine. This may be due to it being only one directory deep, where most other sections are multiple directories deep.
I seem to have a grasp on WHAT is causing the problem, but not WHY or HOW it can be fixed.
Any ideas?
I have tried using file_get_contents() and cURL and both give the same result.
Interestingly, on all the pages that do not work, the incomplete page is roughly 16,000 chars long. Is this a clue?
I have created a test page where you can see the difference:
http://fingerfy.com/test.php?url=http://pitchfork.com/reviews/tracks/
http://fingerfy.com/test.php?url=http://pitchfork.com/reviews/tracks/1/
It prints the strlen() and content of the downloaded page (plus it makes relative urls into absolute, so that CSS is correct).
Any hints would be great!
UPDATE: Mowser, which optimizes pages for mobile has no trouble with these pages (http://mowser.com/web/pitchfork.com/reviews/tracks/2/) so the must be a way to do this without it failing....