views:

191

answers:

6

Hello I like to return the strings in this table

<tr class="rowodd" onclick="window.location.href='/portal/offers/show/entityId/32114';">
  <td>01.10.2009</td>
   <td>AN09551</td>
     <td>[2009132] Ich bin Un.&nbsp;<a href="/portal/clients/show/entityId/762350"><myimsrc="/img/bullet_go.pngs" alt="" title="Kundenakte aufrufen"></a></td>
   <td class="number" title="7.500,00Â&nbsp;€">7.500,00Â&nbsp;</td>
    <td>Entwurf</td>
     </tr>

I tryed Also this:

#<tr>.*?<t.*?>(.*?)</t.*?>.*?<t.*?>(.*?)</t.*?>.*?<t.*?>(.*?)</t.*?>.*?</tr>#s

can anyone help?

+3  A: 

As numerous people will/have pointed out, you're much better off using an HTML/XML parser for the above (like this one). HTML isn't regular and there are numerous edge cases to code around if you use a regular expression.

Given that you just want to extract the text, perhaps XPath will help. An expression such as:

/tr/td/text()

may do the trick.

Brian Agnew
A: 

isn’t strip_tags an option?

it will strip all tags and only leave the text between the tags. it strips attributes too though

in your case this would result in:

  01.10.2009
   AN09551
     [2009132] Ich bin Un. 
   7.500,00 € 
    Entwurf
knittl
could be i haveto test it
streetparade
A: 

Otherwise with a regexp you could use this (with multi-line option):

(?:\<td[^\>]*?\>([^\<]*?)\</td\>)+

But as pointed out by @Brian Agnew, this is just nowhere as good as an xml/html parser...

Locksfree
Worket like nothing else #(?:\<td[^\>]*?\>([^\<]*?)\</td\>)+#siUThanks
streetparade
+1  A: 

Try:

// http://simplehtmldom.sourceforge.net/
include('simple_html_dom.php');
$str = '<tr class="rowodd" onclick="window.location.href=\'/portal/offers/show/entityId/32114\';">
  <td>
    01.10.2009
  </td>
  <td>
    AN09551
  </td>
  <td>
    [2009132] Ich bin Un. <a href="/portal/clients/show/entityId/762350">
    <myimsrc="/img/bullet_go.pngs" alt="" title="Kundenakte aufrufen"></a>
  </td>
  <td class="number" title="7.500,00">
    7.500,00
  </td>
  <td>
    Entwurf
  </td>
</tr>';
$html = str_get_html($str);
foreach($html->find('td') as $element) {
  echo trim($element->innertext) . "\n";
}

Output:

01.10.2009
AN09551
[2009132] Ich bin Un. <a href="/portal/clients/show/entityId/762350">
    <myimsrc="/img/bullet_go.pngs" alt="" title="Kundenakte aufrufen"></a>
7.500,00
Entwurf
Bart Kiers
Call to undefined function str_get_html()is it simple_html_parser?
streetparade
but its a html page so there maybealot more td's. So to find all td's isnt a good idea
streetparade
Yes, str_get_html() is defined in simple_html_parser
Bart Kiers
You can get certain (or just one) tr's based on a given attribute and get the td's from it. Read the documentation, it's pretty straight forward.
Bart Kiers
+1  A: 

Don’t use that many inexplicit non-greedy expressions like .*?. Though they do what you want, they come with a lot of backtracking and thus make your whole expression inefficient. Especially when you use so many of them.

Try to be as explicit as possible:

#<tr\b(?:[^"'>]*|"[^"]*"|'[^']*')*>\s*
    <td\b(?:[^"'>]*|"[^"]*"|'[^']*')*>((?:[^<]|(?!</td\s*>)<)*)</td\s*>\s*
    <td\b(?:[^"'>]*|"[^"]*"|'[^']*')*>((?:[^<]|(?!</td\s*>)<)*)</td\s*>\s*
    <td\b(?:[^"'>]*|"[^"]*"|'[^']*')*>((?:[^<]|(?!</td\s*>)<)*)</td\s*>\s*
    <td\b(?:[^"'>]*|"[^"]*"|'[^']*')*>((?:[^<]|(?!</td\s*>)<)*)</td\s*>\s*
    <td\b(?:[^"'>]*|"[^"]*"|'[^']*')*>((?:[^<]|(?!</td\s*>)<)*)</td\s*>\s*
</tr\s*>#sx

But as you see, this is a mess.

You should better use an HTML parser like the one of DOMDocument. Then you can query the elements with XPath as Brian Agnew suggested. That’s way more reliable and comfortable than regular expressions.

Gumbo
Thanks it worrked $pattern = '#<tr\b(?:[^\"\'>]*|\"[^\"]*\"|\'[^\']*\')*>\s* <td\b(?:[^\"\'>]*|\"[^\"]*\"|\'[^\']*\')*>((?:[^<]|(?!</td\s*>)<)*)</td\s*>\s* <td\b(?:[^"\'>]*|"[^"]*"|\'[^\']*\')*>((?:[^<]|(?!</td\s*>)<)*)</td\s*>\s* <td\b(?:[^"\'>]*|"[^"]*"|\'[^\']*\')*>((?:[^<]|(?!</td\s*>)<)*)</td\s*>\s* <td\b(?:[^"\'>]*|"[^"]*"|\'[^\']*\')*>((?:[^<]|(?!</td\s*>)<)*)</td\s*>\s* <td\b(?:[^"\'>]*|"[^"]*"|\'[^\']*\')*>((?:[^<]|(?!</td\s*>)<)*)</td\s*>\s* </tr\s*>#sx';
streetparade
A: 

In PHP world, there's preg_match_all which makes it much easier than do in JS.

$ptn = "/<\s*td[^>]*>([^<^>]*)</;
preg_match_all($ptn, $str, $matches);
print_r($matches);

Test the result in Preg Tester

unigg