views:

88

answers:

4

hello, I'm trying to parse block from html page so i try to preg_match this block with php

if( preg_match('<\/div>(.*?)<div class="adsdiv">', $data, $t)) 

but doesn't work

</div>

blablabla

blablabla

blablabla

<div class="adsdiv">

i want grep only blablabla blablabla words any help

A: 

You need to delimit your regex; use /<\/div>(.*?)<div class="adsdiv">/ instead.

Although it doesn't solve the OP's problem, this *is* a valid point. The regex in the question lacks delimiters and will throw an exception if you try to use it.
Alan Moore
A: 

From the PHP Manual:

s (PCRE_DOTALL) - If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl's /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier.

So, the following should work:

if (preg_match('~<\/div>(.*?)<div class="adsdiv">~s', $data, $t))

The ~ are there to delimit the regular expression.

Alix Axel
thank you very much Alix its work fine
normand
+1  A: 

Apart from what has been said above, also add the /s modifier so . will match newlines. (edit: as Alan kindly pointed out, [^<]+ will match newlines anyway)

I always use /U as well since in these cases you normally want minimal matching by default. (will be faster as well). And /i since people say <div>, <DIV>, or even <Div>...

if (preg_match('/<\/div>([^<]+)<div class="adsdiv">/Usi', $data, $match))
{
    echo "Found: ".$match[1]."<br>";
} else {
    echo "Not found<br>";
}

edit made it a little more explicit!

mvds
thanks mvds for reply but it reply with empty result meaning not work
normand
Ok I added a little code which shows how to get the matched portion out of it. This should work (although, it requires that the input is *exactly* what you are showing; i.e. not some formatted html by firefox-like "view source"!)
mvds
`[^<]` will match newlines whether you use the `/s` modifier or not.
Alan Moore
thanks, updated the answer.
mvds
And I recommend NOT getting in the habit of using the `/U` modifier. It's better to get *out of* the habit of using `.*`. Reluctant quantifiers speed up matching by avoiding excessive backtracking, but you already took care of that by using `[^<]+` instead of `.*`. If anything, the `/U` is slowing you down, because character-for-character, reluctant quantifiers are slower than greedy ones.
Alan Moore
+1  A: 

Regex aint the right tool for this. Here is how to do it with DOM

$html = <<< HTML
<div class="parent">
    <div>
        <p>previous div<p>
    </div>
    blablabla
    blablabla
    blablabla
    <div class="adsdiv">
        <p>other content</p>
    </div>
</div>
HTML;

Content in an HTML Document is TextNodes. Tags are ElementNodes. Your TextNode with the content of blablabla has to have a parent node. For fetching the TextNode value, we will assume you want all the TextNode of the ParentNode of the div with class attribute of adsdiv

$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$nodes = $xPath->query('//div[@class="adsdiv"]');
foreach($nodes as $node) {
    foreach($node->parentNode->childNodes as $child) {
        if($child instanceof DOMText) {
            echo $child->nodeValue;
        }
    };
}

Yes, it's not a funky one liner, but it's also much less of a headache and gives you solid control over the HTML document. Harnessing the Query Power of XPath, we could have shortened the above to

$nodes = $xPath->query('//div[@class="adsdiv"]/../text()');
foreach($nodes as $node) {
    echo $node->nodeValue;
}

I kept it deliberatly verbose to illustrate how to use DOM though.

Gordon