views:

104

answers:

2

We are building a site that allows users to collect and store their favorite products from all over the Internet to one spot. We have an algorithm that filters out and finds the correct image by reading the source code. 80% of the sites work correctly but 2 large companies are blocking redirecting us from a product page to their homepage.

For example this product http://www.gap.com/browse/product.do?pid=741123&kwid=1&sem=false&sdReferer=http://www.gap.com/products/graphic-ts-toddler-boy-clothing-C35792.jsp# picks up the header for gap.com main page and not for the product at hand.

How do we get around this redirect and allows our algorithm to collect the correct image by reading the correct source code?

A: 

I'd imagine you need to change your scraper's user agent string to something that looks like a normal browser (you're probably sending a string like curl or wget by default).

There's a good chance, though, that if you're sending enough traffic their way they'll eventually notice and shut you down in a harder-to-circumvent manner.

ceejayoz
+2  A: 

First, you might ask a lawyer to study the terms of service of your target web sites, and make sure that you won't run into legal problems.

On the technical side, set the Referer [sic] header when requesting the image. The referrer for an image should be the page in which it is embedded. The server may check that to ensure that the image is being requested to satisfy a page render by a browser, rather than a image-harvesting screen scraper.


After a bit of testing with the image in question, it doesn't look the Referer header is required. Perhaps it is simply rejecting an unfamiliar user-agent, or is keying off some other oddity in the request, like a missing Accept header, etc.

erickson