tags:

views:

63

answers:

3

Hello! I'm using simplehtmldom to parse html and I'm stuck in parsing plaintext located outside of any tag (but between two different tags):

<div class="text_small">
 <b>Аdress:</b> 7 Hange Road<br>    
 <b>Phone:</b> 415641587484<br>    
 <b>Contact:</b> Alex<br>    
 <b>Meeting Time:</b> 12:00-13:00<br>
</div>

Is it possible to get these values of Adress, Phone, Contact, Meeting Time? I wonder if there is a opportunity to pass CSS Selectors into nextSibling/previousSibling functions...

foreach($html->find('div.text_small') as $div_descr) 
 {
   foreach($div_descr->find('b') as $b) 
 {
 if ($b->innertext=="Аdress:") {//someaction
                }
 if ($b->innertext=="Phone:") { //someaction
                }
        if ($b->innertext=="Contact:") { //someaction
                }
        if ($b->innertext=="Meeting Time:") { //someaction
                }
    }
 }

What I should use instead "someaction" ?

upd. Yes, I don't have an access for editing the target page. Otherwise, would it be worth to? :)

A: 

if u can put span tag on the values that are not inside the tag. Maybe u can handle it then

Since <span> do nothing to the values until u give it some style

nik
unfortunately, I can not do that, because I don't have an access for editing the target page :(
moogeek
whoops!! didn't know that
nik
+1  A: 

There might be a much simpler solution. (maybe using something else than simple_html_dom)

I haven't found a suitable selector and nextSibling() only returns the next sibling element. (Which is a bit strange. simple_html_dom_node stores two arrays, $children and $nodes. Textnodes are in $nodes but not in $children. And next_sibling() operates on $children).
But since $nodes is a public property of simple_html_dom_node you write some iterator yourself.

<?php
require_once 'simplehtmldom/simple_html_dom.php';
$html = str_get_html('<html><head><title>...</title></head><body>
  <div class="text_small">
    <b>Adress:</b> 9 Hange Road<br>    
    <b>Phone:</b> 999641587484<br>    
    <b>Contact:</b> Alex<br>    
    <b>Meeting Time:</b> 12:00-13:00<br>
  </div>
  <div class="text_small">
    <b>Adress:</b> 8 Hange Road<br>    
    <b>Phone:</b> 888641587484<br>    
    <b>Contact:</b> Bob<br>    
    <b>Meeting Time:</b> 13:00-14:00<br>
  </div>
</body></html>');

foreach($html->find('div.text_small') as $div) {
  $result = parseEntry($div);
  foreach($result as $r) {
    echo "'$r[name]' - '$r[text]'\n";
  }
  echo "========\n"; 
}

function parseEntry(simple_html_dom_node $div) {
  $result = array();
  $current = null;
  for($i=0; $i<count($div->nodes); $i++) {
    if ( HDOM_TYPE_ELEMENT===$div->nodes[$i]->nodetype) {
      if ( !is_null($current) ) {
        $result[] = $current;
        $current = null;
      }
      if ('b'===$div->nodes[$i]->tag) {
        $current = array('name'=>$div->nodes[$i]->text(), 'text'=>'');
      }
    }
    else if (HDOM_TYPE_TEXT===$div->nodes[$i]->nodetype && !is_null($current)) {
      $current['text'] .= $div->nodes[$i]->text();
    }
  }
  if ( !is_null($current) ) {
    $result[] = $current;
  }
  return $result;
}

prints

'Adress:' - ' 9 Hange Road'
'Phone:' - ' 999641587484'
'Contact:' - ' Alex'
'Meeting Time:' - ' 12:00-13:00'
========
'Adress:' - ' 8 Hange Road'
'Phone:' - ' 888641587484'
'Contact:' - ' Bob'
'Meeting Time:' - ' 13:00-14:00'
========

Until someone else finds a simpler solution you might want to use this as a starting point.

VolkerK
Thanks! It works! But if I have several "div.text_small" containers to parse i always get values from the last one! :(
moogeek
modified for multiple divs
VolkerK
Thanks! And can you please advise how can i collect all of them (all these results) into one speciefic array or json data like: {"1":"{"Adress":"9 Hange Road","Phone":"999641587484","Contact":"Alex","Meeting Time":"12:00-13:00"}","2":"{"Adress":"8 Hange Road","Phone":"888641587484","Contact":"Bob","Meeting Time":"13:00-14:00"}....so on.."} ?
moogeek
A: 

any other ideas?

moogeek