views:

63

answers:

2

I'm parsing an XHTML document using PHP's SimpleXML. I need to query a series of ul's in the document for a node containing a specific value, then find that node's parent's direct previous sibling... code will help explain!

Given the following dummy xhtml:

<html>
<head></head>
<body>
...

<ul class="attr-list"> 
    <li>Active Life (active)</li> 
    <ul> 
        <li>Amateur Sports Teams (amateursportsteams)</li> 
        <li>Amusement Parks (amusementparks)</li> 
        <li>Fitness & Instruction (fitness)</li> 
        <ul> 
            <li>Dance Studios (dancestudio)</li> 
            <li>Gyms (gyms)</li> 
            <li>Martial Arts (martialarts)</li> 
            <li>Pilates (pilates)</li> 
            <li>Swimming Lessons/Schools (swimminglessons)</li>  
        </ul> 
        <li>Go Karts (gokarts)</li> 
        <li>Mini Golf (mini_golf)</li> 
        <li>Parks (parks)</li> 
        <ul> 
            <li>Dog Parks (dog_parks)</li> 
            <li>Skate Parks (skate_parks)</li> 
        </ul> 
        <li>Playgrounds (playgrounds)</li> 
        <li>Rafting/Kayaking (rafting)</li> 
        <li>Tennis (tennis)</li> 
        <li>Zoos (zoos)</li> 
    </ul> 
    <li>Arts & Entertainment (arts)</li> 
    <ul> 
        <li>Arcades (arcades)</li> 
        <li>Art Galleries (galleries)</li> 
        <li>Wineries (wineries)</li> 
    </ul> 
    <li>Automotive (auto)</li> 
    <ul> 
        <li>Auto Detailing (auto_detailing)</li> 
        <li>Auto Glass Services (autoglass)</li> 
        <li>Auto Parts & Supplies (autopartssupplies)</li> 
    </ul>
    <li>Nightlife (nightlife)</li>
    <ul>
        <li>Bars (bars)</li>
        <ul>
            <li>Dive Bars (divebars)</li>
        </ul>
    </ul>
</ul>

...
</body>
</html>

I need to be able to query the ul.attr-list for a child element, and discover its "root" category. I cannot change the xhtml to be formed differently.

So, if I have "galleries" as a category, I need to know that it is in the "arts" "root" category. Or, if I have "dog_parks", I need to know that it is in the "active" category. The following code gets the job done, but only with the assumption that at max there are two nested levels:

function get_root_category($shortCategoryName){

    $url = "http://www.yelp.com/developers/documentation/category_list";
    $result = file_get_contents($url);

    $dom = new domDocument();
    @$dom->loadHTML($result);
    $dom->preserveWhiteSpace = false;

    $sxml = simplexml_import_dom($dom);

    $lvl1 = $sxml->xpath("//li[contains(., '".$shortCategoryName."')]/parent::ul/preceding-sibling::li");
    $lvl2 = $sxml->xpath("//li[contains(., '".$shortCategoryName."')]/parent::ul/preceding-sibling::li/parent::ul/preceding-sibling::li");

    if($lvl2){
        return array_pop($lvl2);
    } else {
        return array_pop($lvl1);
    }
}

There has to be a better way to write that XPath, so that only one query needs to be made, and is relatively bulletproof to multiple, nested levels. Any ideas?

EDIT:: Thanks to those that pointed out that this HTML is not valid. However, the structure of the page is set, and I cannot edit it; I can only use it as a resource, and have to make due with what it is.

+1  A: 

How about:

/html/body/ul/ul[count(descendant::li[contains(.,'dog_parks')]) > 0]/preceding-sibling::li

This should work with deeply nested lists. It always gets the upper-most category.

By the way: I don't think nesting ul's like this is valid.

Wikser
The HTML is indeed syntactically invalid. the nested `<ul>` needs to be inside the `<li>`. In other words, move for each `<ul>` block the preceding `</li>` to directly after the `</ul>`.
BalusC
Yes, you are absolutely correct, but I cannot modify the HTML, since I'm pulling this from a pre-existing page.
Andrew
Thanks for the response, but when I try that xpath, it returns false. The actual html is at: http://www.yelp.com/developers/documentation/category_list. Does having the real html help?
Andrew
In given page, the portion of interest lies in deeper. The following should work:`//div[@id='container']/div/ul/li/ul[count(descendant::li[contains(.,'dog_par')]) > 0]/parent::li/preceding-sibling::li[1]`
Wikser
Thanks for the updated response, but I went with Tomalak's.
Andrew
I spoke too soon... but I did try your updated response, in the form of: $sxml->xpath("//div[@id='container']/div/ul/li/ul[count(descendant::li[contains(.,'".$shortCategoryName."')]) > 0]/parent::li/preceding-sibling::li[1]"), and it returns false. Any ideas from there?
Andrew
+1  A: 

I need to query a series of ul's in the document for a node containing a specific value, then find that node's parent's direct previous sibling...

That would be (here $v is the value you look for):

$p = "/html/body//ul[li[contains(text(), '$v')]]/preceding-sibling::li[1]";
  • Make sure that you check that $v does not contain single quotes, since this would break the XPath expression.
  • When you want to look for whole words only, use:
    [contains(concat(' ', text(), ' '), concat(' ', '$v', ' '))].
  • When you want to look case-insentitively, use (I abbreviated the full alphabet with ):
    [contains(translate(text(), 'ABC…XYZ', 'abc…xyz'), '{strtolower($v)}')].
  • Note that predicates can be nested.
  • Note that the use of text() ensures only direct child text nodes are taken into account. When you use . instead, the whole "subtree" of the <li> is converted to string and you might get more results than you actually want.
  • Note that I restricted the // operator (a shortcut for the descendant axis) to a certain part of the tree - if you can restrict it further, by all means do so.
    Letting your XPath start with // makes it much slower than it needs to be since all nodes of the entire document are checked, even those that can not under any circumstances produce a match.
  • As others have already noted, the HTML is invalid.
Tomalak
Thanks, that worked perfectly, and the descriptions was especially helpful!
Andrew
I spoke too soon. This works perfectly for what you quoted me as, but doesn't work perfectly on the actual data. It doesn't return the top level category when nested two deep. For example, "Dive Bars (divebars)" returns "Bars (bars)" instead of "Nightlife (nightlife)".
Andrew
@Andrew: Your "dummy html" neither contains "Dive Bars" nor "Bars" nor "Nightlife". My expression selects "Parks (parks)" when `$v` is "Skate Parks", which exactly fits your written requirement. So - where is the error?
Tomalak
Terribly sorry for the delay. Just to close out this question: the dive bar example I gave was not included in the dummy html, and I'm very sorry for that. I've updated the html to reflect it. You definitely don't need to update your answer, as this project has been put on hold for now anyway. Thanks!
Andrew