tags:

views:

142

answers:

3

Match an html tag using perl regex in php.

Want the tag to match if it contains "class=details" somewhere in the open tag.

Wanting to match <table border="0" class="details"> not <table border="0">

Wrote this to match it:

'#<table(.+?)class="details"(.+?)>#is'

The <table(.+?) creates a problem since it matches the first table tag it finds only stopping the match when it finds class="details" no matter how far down the code it occurs.

I think this logic would fix my problem:

"Match <table but only if it contains class="details" before the next >"

How can I write this?

A: 

You will probably need a Positive Look Ahead of some form, as a very crude one that clearly has its limitations...

<table(?=[^>]*class="details")[^>]*>
Scuzzy
AFAIK lookahead/behind do not support regex with variable size match. So this wont work.
Gopi
This will work in PHP for me:<?php echo preg_match('/<table(?=[^>]*class="details")[^>]*>/','<table border="0" class="details">'); ?>
Scuzzy
-1: HTML and regexes is like caesium and water... You are waiting for a disaster if you mix both together. Please see [this answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)
Andrew Moore
Thanks Scuzzy, I was struggling with positive lookaheads and this works well for my specific purpose, even if it is bad for the task. For anyone else reading this and trying to consume html, read Andrew Moore's warning and don't use Regex. It is not the proper solution.
JMC
Regexes are not the best solution for parsing HTML. But as long as you acknowledge the problem, there's no harm done in using them. There might be issues, like not matching certain data, but so? I don't understand you downvoting a perfectly valid solution if it doesn't use your choice of utilities.
jmz
+1  A: 

HTML is not parseable ( reliably ) using regular expressions. There are few simple cases which have a solution but they are exceptions. I think that your case is unsolvable using regex but I am not sure

You should work with it using XML tools and XML parsers like XPath for searching and testing your conditions. There is very simple to write the expression which matches your case. I don't know how to build XML tree and execute XPath query in PHP but XPath expression is

//table[@class='details']
Gaim
+1 for the correct XPath
Gordon
A: 

While regular expressions can be good for a large variety of tasks, I find it usually falls short when parsing HTML DOM. The problem with HTML is that the structure of your document is so variable that it is hard to accurately (and by accurately I mean 100% success rate with no false positive) extract a tag.

What I recommend you do is use a DOM parser such as phpQuery and use it as such:

function get_first_image($html){
    $dom = phpQuery::newDocument($html);

    $first_img = $dom->find('img:first');

    if($first_img !== null) {
        return $first_img->attr('src');
    }

    return null;
}

Some may think this is overkill, but in the end, it will be easier to maintain and also allows for more extensibility. For example, using the DOM parser, I can also get the alt attribute.

A regular expression could be devised to achieve the same goal but would be limited in such way that it would force the alt attribute to be after the src or the opposite, and to overcome this limitation would add more complexity to the regular expression.

Also, consider the following. To properly match an <img> tag using regular expressions and to get only the src attribute (captured in group 2), you need the following regular expression:

<\s*?img\s+[^>]*?\s*src\s*=\s*(["'])((\\?+.)*?)\1[^>]*?>

And then again, the above can fail if:

  • The attribute or tag name is in capital and the i modifier is not used.
  • Quotes are not used around the src attribute.
  • Another attribute then src uses the > character somewhere in their value.
  • Some other reason I have not foreseen.

So again, simply don't use regular expressions to parse a dom document.

Simple example on how to solve your problem with phpQuery:

$dom = phpQuery::newDocument($html);
$matching_tags = $dom->find('.details');
Andrew Moore
Problem of HTML is not that is "variable" but that is SGML which is parent of XML and these languages are not parseable using Turing's machine
Gaim
@Gaim: chances are, if someone is trying to parse html using regex, they don't know anything about Turing's machines...
Andrew Moore
@Andrew Of course, simple cases are solvable but You can't be sure that your solution works always. XML tools are certainty.
Gaim
As a novice, i think your answer makes sense. Covert the source html to xml and parse the xml using a parser such as xpath? That way it evaluates the same regardless of the conditions?
JMC
@acidjazz You don't have to convert nothing. You have a document so build a XML tree over it. XPath is not parser, it is query language. Parser for XML are DOM and SAX. XPath is a query language. After you build a tree then you can execute this query over this tree and it returns all tags which matches your rule
Gaim
@Gaim Thanks for clearing my confusion. Your solution seems the best theory for solving the problem. Andrew Moore's is good for php specific, since the response provides php examples and likely gets the job done. Possibly not %100 of the time as you stated. Thanks for both of your responses.
JMC
@Gaim: phpQuery is simply a class library over XPath.
Andrew Moore