views:

140

answers:

3

Parsing HTML / JS codes to get info using PHP.

www.asos.com/Asos/Little-Asos-Union-Jack-T-Shirt/Prod/pgeproduct.aspx?iid=1273626

Take a look at this page, it's a clothes shop for kids. This is one of their items and I want to point out the size section. What we need to do here is to get all the sizes for this item and check whether the sizes are available or not. Right now all the sizes for this items are:

3-4 years
4-5 years
5-6 years
7-8 years

How can you say if the sizes are available or not?

Now take a look at this page first and check the sizes again:

www.asos.com/Ralph-Lauren/Ralph-Lauren-Long-Sleeve-Big-Horse-Stripe-Rugby-Top/Prod/pgeproduct.aspx?iid=1111751

This item has the following sizes:

12 months
18 months - Not Available
24 months

As you can see 18 months size is not available, it is indicated by the "Not Available" text next to the size.

What we need to do is go the page of an item, get the sizes and check the availability of each sizes. How can I do this in PHP?

EDIT:

Added a working code and a new problem to tackle.

Working code but it needs more work:

<?php

function getProductVariations($url) {

  //Use CURL to get the raw HTML for the page
  $ch = curl_init();
  curl_setopt_array($ch,
    array(
      CURLOPT_RETURNTRANSFER=>true,
      CURLOPT_HEADER => false,
      CURLOPT_URL => $url
    )
  );
  $raw_html = curl_exec($ch);

  //If we get an invalid response back from the server fail
  if ($raw_html===false) {
    throw new Exception(curl_error($ch));
  }

  curl_close($ch);

  //Find the variation JS declarations and extract them
  $raw_variations = preg_match_all("/arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct\[[0-9]+\].*Array\((.*)\);/",$raw_html,$raw_matches);

  //We are done with the Raw HTML now
  unset($raw_html);

  //Check that we got some results back
  if (is_array($raw_matches) && isset($raw_matches[1]) && sizeof($raw_matches[1])==$raw_variations && $raw_variations>0) {

    //This is where the matches will go
    $matches = array();

    //Go through the results of the bracketed expression and convert them to a PHP assoc array
    foreach($raw_matches[1] as $match) {

      //As they are declared in javascript we can use json_decode to process them nicely, they just need wrapping
      $proc=json_decode("[$match]");

      //Label the fields as best we can
      $proc2=array(
        "variation_id"=>$proc[0],
        "size_desc"=>$proc[1],
        "colour_desc"=>$proc[2],
        "available"=>(trim(strtolower($proc[3]))=="true"),
        "unknown_col1"=>$proc[4],
        "price"=>$proc[5],
        "unknown_col2"=>$proc[6],       /*Always seems to be zero*/
        "currency"=>$proc[7],
        "unknown_col3"=>$proc[8],
        "unknown_col4"=>$proc[9],       /*Negative price*/
        "unknown_col5"=>$proc[10],      /*Always seems to be zero*/
        "unknown_col6"=>$proc[11]       /*Always seems to be zero*/
      );

      //Push the processed variation onto the results array
      $matches[$proc[0]]=$proc2;

      //We are done with our proc2 array now (proc will be unset by the foreach loop)
      unset($proc2);
    }

    //Return the matches we have found
    return $matches;

  } else {
    throw new Exception("Unable to find any product variations");

  }
}


//EXAMPLE USAGE
try {
  $variations = getProductVariations("http://www.asos.com/Asos/Prod/pgeproduct.aspx?iid=803846");

  //Do something more useful here
  print_r($variations);


} catch(Exception $e) {
  echo "Error: " . $e->getMessage();
}

?>

The above code works, but there's a problem when the product needs you to select a colour first before the sizes are displayed.

Like this one:

http://www.asos.com/Little-Joules/Little-Joules-Stewart-Venus-Fly-Trap-T-Shirt/Prod/pgeproduct.aspx?iid=1171006

Any idea how to go about this?

+1  A: 

The most simple way to fetch the content of a URL is to rely on fopen wrappers and just use file_get_contents with the URL. You can use the tidy extension to parse the HTML and extract content. http://php.net/tidy

Raoul Duke
+1  A: 

You can download the file using fopen() or file_get_contents(), as Raoul Duke said, but if you have experience with the JavaScript DOM model, the DOM extension might be a bit easier to use than Tidy.

I know for a fact that the DOM extension is enabled by default in PHP, but I am a bit unsure if Tidy is (the manual page only says it's "bundeled", so I suspect that it might not be enabled).

Frxstrem
+3  A: 

SOLUTION:

    function curl($url){
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL,$url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
        return curl_exec($ch);
        curl_close ($ch);
    }

$html = curl('http://www.asos.com/pgeproduct.aspx?iid=1111751');

preg_match_all('/arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct\[(.*?)\] \= new Array\((.*?),\"(.*?)\",\"(.*?)\",\"(.*?)\"/is',$html,$bingo);

echo print_r($bingo);

Link: http://debconf11.com/stackoverflow.php

You are on your own now :)

EDIT2:

Ok, we are close to solution...

<script type="text/javascript">var arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct = new Array;
arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct[0] = new Array(1164,"12 months","SailingOrange","True","","59.00","0.00","£","","-59.00","0.00","0");
arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct[1] = new Array(1165,"18 months","SailingOrange","False","","59.00","0.00","£","","-59.00","0.00","0");
arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct[2] = new Array(1167,"24 months","SailingOrange","True","","59.00","0.00","£","","-59.00","0.00","0");
</script>

It is not loaded via ajax, instead array is in javascript variable. You can parse this with PHP, you can clearly see that 18 months is a False, which means it is not available.

EDIT:

This sizes are loaded via javascript, therefore you cannot parse them since they are not there. I can extract only this...

<select name="drpdwnSize" id="ctl00_ContentMainPage_ctlSeparateProduct_drpdwnSize" onchange="drpdwnSizeChange(this, 'ctl00_ContentMainPage_ctlSeparateProduct', arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct);">
<option value="-1">Select Size</option>
</select>

You can sniff JS to check if you can load sizes based on product id.


First you need: http://simplehtmldom.sourceforge.net/ Forget file_get_contents() it is ~5 slower than cURL.

You then parse this piece of code (html with id ctl00_ContentMainPage_ctlSeparateProduct_drpdwnSize)

        <select id="ctl00_ContentMainPage_ctlSeparateProduct_drpdwnSize" name="ctl00$ContentMainPage$ctlSeparateProduct$drpdwnSize" onchange="drpdwnSizeChange(this, 'ctl00_ContentMainPage_ctlSeparateProduct', arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct);">

        <option value="-1">Select Size</option><option value="1164">12 months</option><option value="1165">18 months - Not Available</option><option value="1167">24 months</option></select>

You can then use preg_match(),explode(),str_replace() and others to filter out values you want. I can write it but I don't have time right now :)

Webarto
Suggested third party alternatives that actually use DOM instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org).
Gordon
I found out as well that the sizes selection is populated by javascript. I'm more lost now on what to do, what is a sniff JS?
GoDesigner
To check from what script (server side) are sizes loaded, I tried to find but this just is a mess. It has tons of JS, which I'm not sure it's needed. Please wait...
Webarto
Now you just have to get data from arrays.
Webarto
hi Webarto! I was able to come up with a code similar as yours using the cURL function but yours is much leaner. I've edited my original post and posted my own code. I've added new problem to tackle as well, maybe you could help and suggest a way to solve it? ---------------The above code works, but there's a problem when the product needs you to select a colour first before the sizes are displayed.Like this one:http://www.asos.com/Little-Joules/Little-Joules-Stewart-Venus-Fly-Trap-T-Shirt/Prod/pgeproduct.aspx?iid=1171006Any idea how to go about this?
GoDesigner
http://www.debconf11.com/stackoverflow.phpI just changed product id, principle is the same, you need to extract data (size,color), then group them by color, that is pretty much it.I would write it to you but I really don't have time right now.Keep it simple :)
Webarto
Thank you I got it covered now! This would take me a lot of time to get without your help. God bless you Webarto.
GoDesigner
No problemo, glad to help, can you set this question as answered :P?
Webarto