tags:

views:

18

answers:

1

My script works with all other links I tried, and i get the same response with cURL also (and this is a lot smaller, so I like this code):

<?php
    $url = $_GET['url'];
    $header = get_headers($url,1);
    print_r($header);
    function get_url($u,$h){
        if(preg_match('/200/',$h[0])){
            echo file_get_contents($u);
        }
        elseif(preg_match('/301/',$h[0])){
            $nh = get_headers($h['Location']);
            get_url($h['Location'],$nh);
        }
    }
    get_url($url,$header);
?>

But for: http://www.anthropologie.com/anthro/catalog/productdetail.jsp?subCategoryId=HOME-TABLETOP-UTENSILS&amp;id=78110&amp;catId=HOME-TABLETOP&amp;pushId=HOME-TABLETOP&amp;popId=HOME&amp;sortProperties=&amp;navCount=355&amp;navAction=top&amp;fromCategoryPage=true&amp;selectedProductSize=&amp;selectedProductSize1=&amp;color=sil&amp;colorName=SILVER&amp;isProduct=true&amp;isBigImage=&amp;templateType=

And: http://www.urbanoutfitters.com/urban/catalog/productdetail.jsp?itemdescription=true&amp;itemCount=80&amp;startValue=1&amp;selectedProductColor=&amp;sortby=&amp;id=14135412&amp;parentid=A_FURN_BATH&amp;sortProperties=+subCategoryPosition,&amp;navCount=56&amp;navAction=poppushpush&amp;color=&amp;pushId=A_FURN_BATH&amp;popId=A_DECORATE&amp;prepushId=&amp;selectedProductSize=

(and all Anthropologie product links). I'm assuming other sites I have no yet found act this way also. Here is my header response:

Array
(
    [0] => HTTP/1.1 200 OK
    [Server] => Apache
    [X-Powered-By] => Servlet 2.4; JBoss-4.2.0.GA_CP05 (build: SVNTag=JBPAPP_4_2_0_GA_CP05 date=200810231548)/JBossWeb-2.0
    [X-ATG-Version] => version=RENTLUFEQyxBVEdQbGF0Zm9ybS85LjFwMSxBREMgWyBEUFNMaWNlbnNlLzAgIF0=
    [Content-Type] => text/html;charset=ISO-8859-1
    [Date] => Sat, 24 Jul 2010 23:47:47 GMT
    [Content-Length] => 21669
    [Connection] => keep-alive
    [Set-Cookie] => Array
        (
            [0] => JSESSIONID=65CA111ADBF267A3B405C69A325576F8.app46-node2; Path=/
            [1] => visitCount=1; Expires=Fri, 29-May-2026 00:41:07 GMT; Path=/
            [2] => UOCCII:=; Expires=Mon, 23-Aug-2010 23:47:47 GMT; Path=/
            [3] => LastVisited=2010-07-24; Expires=Fri, 29-May-2026 00:41:07 GMT; Path=/
        )

)

I'm guessing maybe it has to do with the cookies? Any ideas?

A: 

Install fiddler and see what is actually being sent.

You can also try setting your user-agent to a real browser. Sometimes sites try to prevent scraping by checking this.

Byron Whitlock
I did in my cURL code. I set a browser and a referrer and im on Linux and Mac... no Windows :\
Oscar Godson
I am sure there is a mac equivalent out there. Doing this kind of work without being able to see the raw data going back and forth is working blind.
Byron Whitlock