views:

42

answers:

1

I've been doing some scraping with PHP and getting some strange results on a particular domain. For example, when I download this page:

http://pitchfork.com/reviews/tracks/

It works fine. However if I try to download this page:

http://pitchfork.com/reviews/tracks/1/

It returns an incomplete page, even though the content is exactly the same. All subsequent pages (tracks/2/, tracks/3/, etc) also return incomplete data.

It seems to be a problem with the way the URLs are formed during pagination. Most other sections on the site exhibit the same behaviour (the landing page works, but not subsequent pages). One exception is this section:

http://pitchfork.com/forkcast/

Where forkcast/2/ etc work fine. This may be due to it being only one directory deep, where most other sections are multiple directories deep.

I seem to have a grasp on WHAT is causing the problem, but not WHY or HOW it can be fixed.

Any ideas?

I have tried using file_get_contents() and cURL and both give the same result.

Interestingly, on all the pages that do not work, the incomplete page is roughly 16,000 chars long. Is this a clue?

I have created a test page where you can see the difference:

http://fingerfy.com/test.php?url=http://pitchfork.com/reviews/tracks/

http://fingerfy.com/test.php?url=http://pitchfork.com/reviews/tracks/1/

It prints the strlen() and content of the downloaded page (plus it makes relative urls into absolute, so that CSS is correct).

Any hints would be great!

UPDATE: Mowser, which optimizes pages for mobile has no trouble with these pages (http://mowser.com/web/pitchfork.com/reviews/tracks/2/) so the must be a way to do this without it failing....

A: 

It looks like pitchfork's running a CMS with "human" urls. That'd mean that /review/tracks would bring up a "homepage" with multiple postings listed, but "/reviews/tracks/1" would bring up only "review #1". It's possible they've configured the CMS to output only a fixed length excerpt, or have an output filter mis-configured and chop off the individual posts pages early.

I've tried fetching /tracks/1 through /tracks/6 using wget, and they all have different content which terminates at 16,097 bytes exactly, usually in the middle of a tag. So, it's not likely this is anything you can fix on your end, as it's the site itself sending bad data.

Marc B
Yes, definitely something to do with the 'human' URLs. However, there's a few strange things going on:a) the reviews homepage and page 1 should be returning the exact same data (as they do in the browser), yet one works and the other doesntb) unlike all other sections, the forkcast section (http://pitchfork.com/forkcast/) works perfectly fine, including subsequent pagesc) the incomplete pages are always 16kbThis suggests to me that it is flushing the output at 16kb, rather than outputting the whole page. I still have hope because the pages are fetched fine when viewed in a browser. ..
Peter Watts