views:

321

answers:

5

Hi

I want to parse content from

<td>content</td>
and
<td *?*>content</td>
and 
<td *specific td class*>content</td>

How can i make this with regex, php and preg match?

+3  A: 

If you have an HTML document, you really shouldn't use regular expressions to parse it : HTML is just not "regular" enough for that.

A far better solution would be to load your HTML document using a DOM parser -- for instance, DOMDocument::loadHTML and Xpath queries often do a really great job !

Pascal MARTIN
seconded... regex is the hard way.
prodigitalson
A: 

Don't use regex... via this path lies madness!

Hogan
I think this argument is a little tired. If you want to just quickly extract a piece of HTML, regex is more than adequate.
yu_sha
um... No. Regex will never work right and will always fail. It is a bad choice. Bad, bad, bad, bad, bad choice. You will always have bugs. Why implement something guaranteed to have bugs?
Hogan
If you want to scrape a page without respect to its structure, regexes are fine. If you want to *parse* a page of HTML, that is categorically impossible using only regular expressions.
Paul Nathan
hmmm... so if I understand, you are saying regex is ok if you are looking at a page for a group of characters ("scraping"), in this case somewhere on the page content inside of td tags? Of course there is no guarantee this is the content you want, you will get some content inside of some td tags.... thus it has bugs.
Hogan
If you are reviewing the text for a simple expression, regexs are fine. If you are trying to structurally analyze it, re's are going to fail and be a problem due to fundamental limits of re theory(see: automata theory/chomsky hierarchy).
Paul Nathan
Almost everyone on SO has jumped aboard the regex is evil when used with HTML bandwagon, without any real understanding of the issues. There are examples when it's acceptable and desirable to use regex to extract data, in most cases when people say "parse", that isn't really what they mean.
Paul Creasey
I believe this might be the case -- however, every time I've tried to/ seen regex used to "find stuff" in HTML it has caused problems. Not that there does not exist a use-case that does not have problems, I've just not seen one yet.
Hogan
A: 

<td>content</td>: <td>([^<]*)</td>

<td *specific td class*>content</td>: <td[^>]*class=\"specific_class\"[^>]*>([^<]*)<

yu_sha
+4  A: 

I think this sums it up pretty good.

In short, don't use regular expressions to parse HTML. Instead, look at the DOM classes and especially DOMDocument::loadHTML

Emil Vikström
A: 

@OP, here's one way

$str = <<<A
<td>content</td>
<td *?*>content</td>
<td *specific td class*>content</td>
<td *?*> multiline
content </td>
A;

$s = explode("</td>",$str);
foreach ($s as $a=>$b){
    $b=preg_replace("/.*<td.*>/","",$b);
    print $b."\n";
}

output

$ php test.php
content

content

content

 multiline
content
ghostdog74