ansaurus

Question

Regex: Match html tag only if it contains a specific class id

Answer 1

A:

You will probably need a Positive Look Ahead of some form, as a very crude one that clearly has its limitations...

<table(?=[^>]*class="details")[^>]*>

Scuzzy 2010-07-30 05:23:09

AFAIK lookahead/behind do not support regex with variable size match. So this wont work.

Gopi 2010-07-30 05:24:30

This will work in PHP for me:<?php echo preg_match('/<table(?=[^>]*class="details")[^>]*>/','<table border="0" class="details">'); ?>

Scuzzy 2010-07-30 05:30:29

-1: HTML and regexes is like caesium and water... You are waiting for a disaster if you mix both together. Please see [this answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)

Andrew Moore 2010-07-30 05:34:36

Thanks Scuzzy, I was struggling with positive lookaheads and this works well for my specific purpose, even if it is bad for the task. For anyone else reading this and trying to consume html, read Andrew Moore's warning and don't use Regex. It is not the proper solution.

JMC 2010-07-30 06:15:58

Regexes are not the best solution for parsing HTML. But as long as you acknowledge the problem, there's no harm done in using them. There might be issues, like not matching certain data, but so? I don't understand you downvoting a perfectly valid solution if it doesn't use your choice of utilities.

jmz 2010-07-30 10:19:24

Answer 2

+1 A:

HTML is not parseable ( reliably ) using regular expressions. There are few simple cases which have a solution but they are exceptions. I think that your case is unsolvable using regex but I am not sure

You should work with it using XML tools and XML parsers like XPath for searching and testing your conditions. There is very simple to write the expression which matches your case. I don't know how to build XML tree and execute XPath query in PHP but XPath expression is

//table[@class='details']

Gaim 2010-07-30 05:25:39

+1 for the correct XPath

Gordon 2010-07-30 06:34:43

Answer 3

A:

While regular expressions can be good for a large variety of tasks, I find it usually falls short when parsing HTML DOM. The problem with HTML is that the structure of your document is so variable that it is hard to accurately (and by accurately I mean 100% success rate with no false positive) extract a tag.

What I recommend you do is use a DOM parser such as phpQuery and use it as such:

function get_first_image($html){
    $dom = phpQuery::newDocument($html);

    $first_img = $dom->find('img:first');

    if($first_img !== null) {
        return $first_img->attr('src');
    }

    return null;
}

Some may think this is overkill, but in the end, it will be easier to maintain and also allows for more extensibility. For example, using the DOM parser, I can also get the alt attribute.

A regular expression could be devised to achieve the same goal but would be limited in such way that it would force the alt attribute to be after the src or the opposite, and to overcome this limitation would add more complexity to the regular expression.

Also, consider the following. To properly match an <img> tag using regular expressions and to get only the src attribute (captured in group 2), you need the following regular expression:

<\s*?img\s+[^>]*?\s*src\s*=\s*(["'])((\\?+.)*?)\1[^>]*?>

And then again, the above can fail if:

The attribute or tag name is in capital and the i modifier is not used.
Quotes are not used around the src attribute.
Another attribute then src uses the > character somewhere in their value.
Some other reason I have not foreseen.

So again, simply don't use regular expressions to parse a dom document.

Simple example on how to solve your problem with phpQuery:

$dom = phpQuery::newDocument($html);
$matching_tags = $dom->find('.details');

Andrew Moore 2010-07-30 05:32:01

Problem of HTML is not that is "variable" but that is SGML which is parent of XML and these languages are not parseable using Turing's machine

Gaim 2010-07-30 05:41:13

@Gaim: chances are, if someone is trying to parse html using regex, they don't know anything about Turing's machines...

Andrew Moore 2010-07-30 05:43:05

@Andrew Of course, simple cases are solvable but You can't be sure that your solution works always. XML tools are certainty.

Gaim 2010-07-30 05:48:25

As a novice, i think your answer makes sense. Covert the source html to xml and parse the xml using a parser such as xpath? That way it evaluates the same regardless of the conditions?

JMC 2010-07-30 06:32:13

@acidjazz You don't have to convert nothing. You have a document so build a XML tree over it. XPath is not parser, it is query language. Parser for XML are DOM and SAX. XPath is a query language. After you build a tree then you can execute this query over this tree and it returns all tags which matches your rule

Gaim 2010-07-30 06:37:31

@Gaim Thanks for clearing my confusion. Your solution seems the best theory for solving the problem. Andrew Moore's is good for php specific, since the response provides php examples and likely gets the job done. Possibly not %100 of the time as you stated. Thanks for both of your responses.

JMC 2010-07-30 07:01:17

@Gaim: phpQuery is simply a class library over XPath.

Andrew Moore 2010-07-31 19:38:14

ansaurus

tags:

views:

answers:

Regex: Match html tag only if it contains a specific class id

related questions