views:

601

answers:

2

I am parsing a HTML document with XPATH and I want to keep all the inner html tags.

The html in question is a unordered list with many list elements.

<ul id="adPoint1"><li>Business</li><li>Contract</li></ul>

I am parsing the document using the following PHP code

$dom = new DOMDocument();
@$dom->loadHTML($output);
$this->xpath = new DOMXPath($dom);
$testDom = $this->xpath->evaluate("//ul[@id='adPoint1']");
$test = $testDom->item(0)->nodeValue;
echo htmlentities($test);

For some reason the output always has the html tags omitted from it. I assume that this is because XPATH was not intended to be used in this way, but is there anyway around this?

I would really like to continue using XPATH as I already use it for parsing other areas of the page (single a href elements) without a problem.

EDIT: I know that there is a better way to get the data by iterating through the child elements of the UL. There is a more complicated part of the page which I also want to parse (block of javascript), but I am trying to provide an easier to understand example.

The actual block of code that I want is

<script language="javascript">document.write(rot_decode('<u7>Pbagnpg Qrgnvyf</u7><qy vq="pbagnpgQrgnvyf"><qg>Cu:</qg><qq>(58) 0078 8455</qq></qy>'));</script>

It has the problem that it omits all the closing tags but keeps the opening tags. I'm guessing it's because XPATH is trying to parse the inner elements rather than just treating it as a string.

If I try and select the script element with

$testDom = $this->xpath->evaluate("//div[@id='businessDetails']/script");
$test = $testDom->item(0)->nodeValue;
echo htmlentities($test);

my output will be, which you can see is missing all the closing tags.

document.write(rot_decode('<u7>Pbagnpg Qrgnvyf<qy vq="pbagnpgQrgnvyf"><qg>Cu:<qq>(58) 0078 8455'));
A: 

Yes you are right, DOM parses the child elements (because they are elements and not strings), and the correct way to get data from child elements is to iterate through all of them. Implementing that would not be complicated, though.
You may want to try a different XPath expression as well, instead of

//ul[@id='adPoint1']

try

//ul[@id='adPoint1']/li

which would select elements with actual string values.
If give the expected result as well (for both the ul and the script) maybe you will get more answers.

phunehehe
phunehehe, yes you are correct but I am looking for a solution that will maintain the tags within an element. I am really trying to get a string which contains the javascript code in its entirety.
m3mbran3
A: 

I decided XPATH wasn't suited for what I wanted and am now using PHP Simple HTML DOM Parser which is much better suited to the task.

It maintains internal html formatting just fine.

foreach($this->simpleDom->find('script[language=javascript]') as $script) {
  echo htmlentities($script->innertext());
}
m3mbran3