tags:

views:

303

answers:

3

I would like to make a simple but non trivial manipulation of DOM Elements with PHP but I am lost.

Assume a page like Wikipedia where you have paragraphs and titles (<p>, <h2>). They are siblings. I would like to take both elements, in sequential order.

I have tried GetElementbyName but then you have no possibility to organize information. I have tried DOMXPath->query() but I found it really confusing.

Just parsing something like:

<html>
  <head></head>
  <body>
    <h2>Title1</h2>
    <p>Paragraph1</p>
    <p>Paragraph2</p>
    <h2>Title2</h2>
    <p>Paragraph3</p>
  </body>
</html>

into:

Title1
Paragraph1
Paragraph2
Title2
Paragraph3

With a few bits of HTML code I do not need between all.

Thank you. I hope question does not look like homework.

+1  A: 

I think DOMXPath->query() is the right approach. This XPath expression will return all nodes that are either a <h2> or a <p> on the same level (since you said they were siblings).

/html/body/*[name() = 'p' or name() = 'h2']

The nodes will be returned as a node list in the right order (document order). You can then construct a foreach loop over the result.

Tomalak
Exactly what I wanted. I had to include a few divs but It worked perfectly.Thank you very much.
Sortea2
Glad to help.
Tomalak
+1  A: 

I have uased a few times simple html dom by S.C.Chen.

Perfect class for access dom elements.

Example:

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>';

Check it out here. simplehtmldom

May help with future projects

Lee
A: 

Try having a look at this library and corresponding project:

Simple HTML DOM

This allows you to open up an online webpage or a html page from filesystem and access its items via class names, tag names and IDs. If you are familiar with jQuery and its syntax you need no time in getting used to this library.

Salman A