ansaurus

Question

PHP external page

Answer 1

+1 A:

SimpleHTMLDOM will make this very easy for you.

The first few lines would look something like this (untested):

// Create DOM from URL or file
$html = file_get_html('http://www.atpworldtour.com/Rankings/Singles.aspx');

// Find all images 
foreach($html->find('table[id=bioTableAlt] tr[class!=bioTableHead]') as $element) 
    {

    }

(not sure about the tr[class!=bioTableHead], if it doesn't work, try a simple tr)

Pekka 2010-08-09 13:30:18

Will try, actually I want only text and no images.

Happy 2010-08-09 13:32:06

Suggested third party alternatives that actually use DOM instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org).

Gordon 2010-08-09 13:32:39

@Gordon you totally have a point, as always. Haven't looked at phpQuery before, that one looks like it could become my new favourite :)

Pekka 2010-08-09 13:34:30

@Pekka, please tell how to catch different <td> with SimpleHTMLDOM, like :nth-child(1) a {}

Happy 2010-08-09 13:37:05

@Ignatz see http://simplehtmldom.sourceforge.net/manual.htm "How to traverse the DOM tree?"

Pekka 2010-08-09 13:39:53

Answer 2

+1 A:

Below is how to do it with PHP's native DOM extension. It should get you halfway to where you want to go.

The page is quite broken in terms of HTML validity and that makes loading with DOM somewhat tricky. Normally, you can use load() to load a page directly. But since the HTML is quite broken, I loaded the page into a string first and used the loadHTML method instead, because it handles broken HTML better.

Also, there is only one table at that page: the ranking table. The scoreboards are loaded via Ajax once the page loaded, so their HTML will not show up in the source code when you load it with PHP. So you can simply grab all TR elements and iterate over them.

libxml_use_internal_errors(TRUE);
$dom = new DOMDocument;
$dom->loadHTML(
    file_get_contents('http://www.atpworldtour.com/Rankings/Singles.aspx'));
libxml_clear_errors();

$rows = $dom->getElementsByTagName('tr');
foreach($rows as $row) {
    foreach( $row->childNodes as $cell) {
        echo trim($cell->nodeValue);
    }
}

This would output all table cell contents. It should be trivial to add those to an array and/or to write them to file.

Gordon 2010-08-09 14:21:44

Thanks for your time.

Happy 2010-08-09 14:34:48

ansaurus

tags:

views:

answers:

PHP external page

related questions