tags:

views:

47

answers:

1

please see code :

$result = "<b>Associated Names</b>&nbsp;&nbsp;[<a href='http://www.examples.com/authors.html?act=change&amp;id=6141&amp;item=associated'&gt;&lt;u&gt;Edit&lt;/u&gt;&lt;/a&gt;]&lt;/td&gt; 
        </tr> 
        <tr> 
          <td class='text' align='left'>G&#12539;R<br />G-R<br />         </td>"

preg_match_all("/<b>Associated Names.{10,100}<td class='text' align='left'>((.*<br \/>)*).*<\/td>/sU", $result, $assoc);
var_dump($assoc);
-----------------------------------------------------------
RESULT 
array
  0 => 
    array
      0 => string '<b>Associated Names</b></td>
        </tr>
        <tr>
          <td class='text' align='left'>G&#12539;R<br />G-R<br />         </td>' (length=135)
  1 => 
    array
      0 => string '' (length=0)
  2 => 
    array
      0 => string '' (length=0)

I want it return

array(
    1 => 
     array
      0 => string 'G&#12539;R',
    2 => 
     array
      0 => string> 'G-R'
)

it is a matter of parentheses ((.)) i want fix it, please help me

+3  A: 

Please don't try to parse HTML with regular expressions, it invokes the wrath of Zalgo.

Try using the DOM and xpath to target the specific elements and attributes you are attempting to extract.

(I'd provide an xpath example, but it's still on my to-learn list... :) )

Charles
thanks for advices
meotimdihia
Unfortunately, some times it is the only way, because not every page is well formated. Many a times, Zend Dom Query has failed to create the dom correctly and I 've got wrong results. Not a fault of the framework of course, but parsing can get messy. I use both approaches, ad hoc.
john
@john, have you tried to run the page through [tidy](http://us2.php.net/manual/en/book.tidy.php) first?
Charles
@Charles, yes, for a specific page that was quite troublesome (scraping project), i used an external tidy service before creating the dom, without success. For the same page, I also tried using a ready made class to tidy it. It always gave back half the page. I decided not to dig deeper. :)
john