views:

39

answers:

1

I'm building a basic screen scraper for personal use and learning purposes, so please do not post comments like "You need to ask permission" etc.

The data I'm trying to access is structured as follows:

<tr>
    <td>
        <div class="wrapper">
            <div class="randomDiv">
                <div class="divContent">
                    <div class="event">asd</div>
                    <div class="date">asd</div>
                    <div class="venue">asd</div>
                    <div class="state">asd</div>
                </div>
            </div>
        </div>
    </td>
</tr>

I'm attempting to gather all this data (as there are about 20 rows on the given page).

Using the following code I have managed to gather the data I need:

$remote = file_get_contents("linktoURL");

$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;
$file = @$doc->loadHTML($remote);
$rows = $doc->getElementsByTagName('tr');
$xp = new DOMXpath($doc);

//initialize variables
$rows = array();

foreach($xp->query('//*[contains(@class, \'wrapper\')]', $doc) as $found) {
    echo "<pre>";
    print_r($found->nodeValue);
}

Now my question is, how would I go about storing all this data into an associative array like below:

Array (
    [0] => Array
        (
            [Event] => Name
            [Date] => 12/12/12
            [Venue] => NameOfPlace
            [state] => state
        )

    [1] => Array
        (
            [Event] => Name
            [Date] => 12/12/12
            [Venue] => NameOfPlace
            [state] => state
        )

    [2] => Array
        (
            [Event] => Name
            [Date] => 12/12/12
            [Venue] => NameOfPlace
            [state] => state
        )

)

Right now, the only solution that comes to mind would be to call the xpath query for each class name //*[contains(@class, \'className\')] in the foreach loop.

Is there a more idiomatic way via DOMDocument and XPath wherein I am able to create an associative array of the above data?

edit:

I'm not limited to using DOMDocument and XPath, if there are other solutions which might be easier, then please post them.

A: 

You can import some functionality into DOMXPath by registering PHP functions, but AFAIK you're limited to returning scalars or nodesets.

You could transform it with a simple stylesheet, using XSLTProcessor::transformToDoc(), possibly exporting it to SimpleXML for easier access. Question is whether it is any faster then searching for every class manually.

You can of course shorten your XPath usage by using //div[contains(@class, 'event') or contains(@class, 'date')] etc.

Wrikken
Thanks for the information. Have not had much time to work on it, hopefully tonight will change that =)
Russell Dias