views:

55

answers:

1

Good day,

I am retrieving information from various websites using cURL and various parsing techniques. I made the code so I can, if desired, add additional websites I scan information from.

The information retrieved is as follow : (Please note that the information may be inaccurate and may not reflect real price/name)

Array
(
    [website1.com] => Array
        (
            [0] => Array
                (
                    [0] => 60" BRAVIA LX900 Series 3D HDTV
                    [1] => website1.com
                    [2] => 5299.99
                )
            [1] => Array
                (
                    [0] => 52" BRAVIA LX900 Series 3D HDTV
                    [1] => website1.com
                    [2] => 4499.99
                )
            [2] => Array
                (
                    [0] => 46" BRAVIA LX900 Series 3D HDTV
                    [1] => website1.com
                    [2] => 3699.99
                )
            [3] => Array
                (
                    [0] => 40" BRAVIA LX900 Series 3D HDTV
                    [1] => website1.com
                    [2] => 2999.99
                )
        )
    [website2.com] => Array
        (
            [0] => Array
                (
                    [0] => Sony 3D 60" LX900 HDTV BRAVIA
                    [1] => website2.com
                    [2] => 5400.99
                )
            [1] => Array
                (
                    [0] => Sony 3D 52" LX900 HDTV BRAVIA
                    [1] => website2.com
                    [2] => 4699.99
                )
            [2] => Array
                (
                    [0] => Sony 3D 46" LX900 HDTV BRAVIA
                    [1] => website2.com
                    [2] => 3899.99
                )
        )
)

The desired output must be :

Array
(
    [0] => Array
        (
            [Name] => 60" BRAVIA LX900 Series 3D HDTV
            [website1.com] => 5299.99
            [website2.com] => 5400.99
        )
    [1] => Array
        (
            [Name] => 52" BRAVIA LX900 Series 3D HDTV
            [website1.com] => 4499.99
            [website2.com] => 4699.99
        )
    [2] => Array
        (
            [Name] => 46" BRAVIA LX900 Series 3D HDTV
            [website1.com] => 3699.99
            [website2.com] => 3899.99
        )
    [3] => Array
        (
            [Name] => 40" BRAVIA LX900 Series 3D HDTV
            [website1.com] => 2999.99
        )
)

Please note that the name may vary so using similar_text is required. Also, some information may not be shown in all websites. I am aware that only one television name must be choosed, then I will use the one from the most relevent source (website1.com)

Here is the codes I'm trying to make work.

<?php
    $_Retreived = array(
        "website1.com" => array(
            array('60" BRAVIA LX900 Series 3D HDTV', 'website1.com', 5299.99),
            array('52" BRAVIA LX900 Series 3D HDTV', 'website1.com', 4499.99),
            array('46" BRAVIA LX900 Series 3D HDTV', 'website1.com', 3699.99),
            array('40" BRAVIA LX900 Series 3D HDTV', 'website1.com', 2999.99)
        ),
        "website2.com" => array(
            array('Sony 3D 60" LX900 HDTV BRAVIA', 'website2.com', 5400.99),
            array('Sony 3D 52" LX900 HDTV BRAVIA', 'website2.com', 4699.99),
            array('Sony 3D 46" LX900 HDTV BRAVIA', 'website2.com', 3899.99),
        )
    );

    $_Prices = array();
    $_PricesTemp = array();
    $_Sites = array("website1.com", "website2.com");

    for($i = 0; $i < sizeOf($_Sites); $i++)
    {
        $_PricesTemp = array_merge($_PricesTemp, $_Retreived[ $_Sites[$i] ]);
    }

    /*
        print_r($_PricesTemp);

        Array
        (
            [0] => Array
                (
                    [0] => 60" BRAVIA LX900 Series 3D HDTV
                    [1] => website1.com
                    [2] => 5299.99
                )
            [1] => Array
                (
                    [0] => 52" BRAVIA LX900 Series 3D HDTV
                    [1] => website1.com
                    [2] => 4499.99
                )
            [2] => Array
                (
                    [0] => 46" BRAVIA LX900 Series 3D HDTV
                    [1] => website1.com
                    [2] => 3699.99
                )
            [3] => Array
                (
                    [0] => 40" BRAVIA LX900 Series 3D HDTV
                    [1] => website1.com
                    [2] => 2999.99
                )
            [4] => Array
                (
                    [0] => Sony 3D 60" LX900 HDTV BRAVIA
                    [1] => website2.com
                    [2] => 5400.99
                )
            [5] => Array
                (
                    [0] => Sony 3D 52" LX900 HDTV BRAVIA
                    [1] => website2.com
                    [2] => 4699.99
                )
            [6] => Array
                (
                    [0] => Sony 3D 46" LX900 HDTV BRAVIA
                    [1] => website2.com
                    [2] => 3899.99
                )
        )
    */

    foreach($_PricesTemp As $_KeyOne => $_EntryOne)
    {
        foreach(array_reverse($_PricesTemp, true) As $_KeyTwo => $_EntryTwo)
        {
            if ($_KeyOne != $_KeyTwo)
            {
                $_Percent = 0;

                similar_text(strtoupper($_EntryOne[0]), strtoupper($_EntryTwo[0]), $_Percent);

                if ($_Percent >= 90) //If names matches 90%+
                {
                    echo "Similar : <b>" . $_KeyOne . "</b> " . $_EntryOne[0] . " and <b>" . $_KeyTwo . "</b> " . $_EntryTwo[0] . " Percent : " . $_Percent . "<br />";

                    $_Prices[] = array();
                    $_Prices[ sizeOf($_Prices)-1 ]['Name'] = $_EntryOne[0]; //Use the product name of the most revelant website (website1.com)

                    foreach($_Sites As $_Site)
                    {
                        if (isset($_EntryOne[ 1 ]) && $_EntryOne[ 1 ] == $_Site) //Check if it contains price from website1.com
                        {
                            $_Prices[ sizeOf($_Prices)-1 ][ $_Site ] = $_EntryOne[ 2 ];
                        }
                        if (isset($_EntryTwo[ 1 ]) && $_EntryTwo[ 1 ] == $_Site) //Check if it contains price from website2.com
                        {
                            $_Prices[ sizeOf($_Prices)-1 ][ $_Site ] = $_EntryTwo[ 2 ];
                        }
                    }
                }
            }
        }
    }

    /*
        print_r($_Prices);

        Array
        (
            [0] => Array
                (
                    [Name] => 60" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 2999.99
                )
            [1] => Array
                (
                    [Name] => 60" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 3699.99
                )
            [2] => Array
                (
                    [Name] => 60" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 4499.99
                )
            [3] => Array
                (
                    [Name] => 52" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 2999.99
                )
            [4] => Array
                (
                    [Name] => 52" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 3699.99
                )
            [5] => Array
                (
                    [Name] => 52" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 5299.99
                )
            [6] => Array
                (
                    [Name] => 46" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 2999.99
                )
            [7] => Array
                (
                    [Name] => 46" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 4499.99
                )
            [8] => Array
                (
                    [Name] => 46" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 5299.99
                )
            [9] => Array
                (
                    [Name] => 40" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 3699.99
                )
            [10] => Array
                (
                    [Name] => 40" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 4499.99
                )
            [11] => Array
                (
                    [Name] => 40" BRAVIA LX900 Series 3D HDTV
                    [website1.com] => 5299.99
                )
            [12] => Array
                (
                    [Name] => Sony 3D 60" LX900 HDTV BRAVIA
                    [website2.com] => 3899.99
                )
            [13] => Array
                (
                    [Name] => Sony 3D 60" LX900 HDTV BRAVIA
                    [website2.com] => 4699.99
                )
            [14] => Array
                (
                    [Name] => Sony 3D 52" LX900 HDTV BRAVIA
                    [website2.com] => 3899.99
                )
            [15] => Array
                (
                    [Name] => Sony 3D 52" LX900 HDTV BRAVIA
                    [website2.com] => 5400.99
                )
            [16] => Array
                (
                    [Name] => Sony 3D 46" LX900 HDTV BRAVIA
                    [website2.com] => 4699.99
                )
            [17] => Array
                (
                    [Name] => Sony 3D 46" LX900 HDTV BRAVIA
                    [website2.com] => 5400.99
                )
        )
    */
?>

First of all, the code above isn't working. There must be a logical error somewhere that I can't put my finger on. Also, I don't believe the code will work in case I add a third website to the listing.

Any ideas guys? I have been on this since this morning.

A: 

A high level overview should be like this:

  • create the end result array $items
  • loop through all the found items in all the websites
  • for each one, check if it's similar enough to any of the existing item names in $items
  • if yes, then add the price to that key, if no then create a new one and add it there

Instead of similar_text() you should consider using levenshtein() which is similar in practice but quite faster.

Here's some (untested, on the spot) code:

$levThreshold = 3 ;

$_Prices = array() ;
foreach ($_Retreived as $website => $websiteItems) {
    $currName = $websiteItems[0] ;
    $currWebsite = $websiteItems[1] ;
    $currPrice = $websiteItems[2] ;

    $foundItemKey = false ;

    //check current price structure. Get $priceData by reference
    //so we can modify it in the loop and keep the changed instead 
    //of the loop copy.
    foreach ($_Prices as &$priceData) {

        if (isset($priceData[$website])) {
            //already done this
            continue ;
        }

        //check if this is the item name we are looping over
        $lev = levenshtein($priceData['Name'], $currName) ;

        if ($lev < $levThreshold) {
            //item exists, add price and break
            $priceData[$website] = $currPrice ;
            $foundItemKey = true ;
            break ;
        }

    }

    //if we haven't found the item key, create a new one
    if (!$foundItemKey) {
        $newItem = array() ;
        $newItem['Name'] = $currName ;
        $newItem[$website] = $currPrice ; 
        $_Prices[] = $newItem ;
    }

}

$levThreshold is the minimum number of characters that must be different between 2 strings for them to be considered different. You can tweak that accordingly.

Fanis
Hi Fanis, I know you answered without testing. But your code isn't working. No error but there must be a logical issue somewhere. I'm trying to fix it but with no success really. Thank you about the levenshtein I wasnt aware of it's existence.
Cybrix
Okay, I'll take a look tomorrow morning.
Fanis
Hi, after looking at your code, it seems that the second foreach isnt working.
Cybrix