views:

372

answers:

1

I want to scrape some html with simplehtmldom in php. I have a bunch of tags containg tags. The tags I want alternate between bgcolor=#ffffff and bgcolor=#cccccc. There are some tags that have other bgcolors.

I want to get all the code in each tag that has either bgcolor=#ffffff or bgcolor=#cccccc. I can't just use $html->find('tr') as there are other tags that I don't want to find.

Any help would be appreciated.

A: 

You could load the DOM into a simplexml class and then use xpath, like so:

$xml = simplexml_import_dom($simple_html_dom);

$goodies = $xml -> xpath('//[@bgcolor = "#ffffff"] | //[@bgcolor = "#cccccc"]');

you might even be able to put that OR syntax within the same set of brackets, but I'd need to double check.


Update:

Sorry, I thought you were talking about the DOM extension. I just looked up simpledomhtml, and it appears that its find feature is loosely based on XPath. why not just do:

$goodies = $html -> find('[bgcolor=#ffffff], [bgcolor="#cccccc]');

Anthony
I don't understand. What is $simple_html_dom. When do I call the find method and what do I pass in.
$simple_html_dom would be the variable you you had your simplehtmldom set to, so whatever you were using for the find method originally. But now that I'm looking at the extension, I'm unsure if it uses the DOM extension as the foundation and thus if my first answer would apply. And you wouldn't apply the find method, in my orignal answer, the xpath method does the finding. It passes all the results of the xpath query to the $goodies variable, which you could then traverse and import each result back as xml or html (which I didn't mention, sorry). But I think ...
Anthony
my second, more informed suggestion should do the trick, unless I'm understanding how simplehtmldom works or what you are looking to do with it.
Anthony
I am still not getting what I want. Is it possible to get the data between two tags ie <tr> </tr>? How would I do that?
Do you mean its not returning the descendants of the <tr> nodes? Try a quick experiment. Create an HTML file called test.html, inside of it put <ul id="mainlist"><li>stuff</li><li><ul id="sub_list"><li>sub-stuff</li><li>more-sub-stuff</li></ul></li></ul> And do: find('ul[id=mainlist] li'); If it should catch the first li with no problem, but if it's not set up to include children in the find results, then it won't show you the contents of the second li, which is another ul. If that's the case, I'll tell you what else I find (still reading on it).
Anthony
Here is my code:<?phpinclude_once 'simple_html_dom.php'; $url = "test.html"; $html = file_get_html($url); foreach($html->find('ul[id=mainlist] li') as $li) { echo $li->plaintext."<br /> \n"; }?>Here is what I get:stuffsub-stuffmore-sub-stuffsub-stuffmore-sub-stuff
So it is returning the children in my example. Real quick, is all of the data you want inside <td> tags? Could you just use find->('[bgcolor=#ffffff] td') ?
Anthony
And while I think this is a neat add on that you are using, you may want to consider looking into what it's build on which is the DOM extension in PHP. Or looking at how to use DOM with simplexml. You would get the results you wanted even if the syntax wasn't as clean. And are you wanting the data in the tr tags, or the HTML?
Anthony