tags:

views:

131

answers:

5

I'm trying to retrieve link text from an HTML file. Each of the link have a specific class applied to them, but the urls are different.

I have the following:

...
<a class="fetch-me" href="products/1">Find ME!!!</a>
...
<a class="fetch-me" href="products/2">Me too!</a>
...

I am using the following PHP code, but always getting more than I want:

preg_match_all('<a class="fetch-me" href=".*">(.*)</a>)siU', $string, $matching_data);
A: 

What about something like:

/<a[^>]*([^<]*)<\/a>/siU
Mark E
A: 

If you must use a regex, use .*? instead of .*. *? is the non-greedy version of *; that is, rather than matching as much as possible, it matches as little as possible.

(By the way, don't try matching HTML or XML with regular expressions; that way lies madness. Instead, try using an HTML or XML parser. If you don't have an HTML parser, run it through HTML Tidy and use an XML parser. See meder's answer for how to do this in PHP.).

Brian Campbell
I would say regex is ok for such a small and specific task (where nothing can really go wrong). But I'm probably going to get killed for saying this.
Joel L
Clearly, something can go wrong, as he's having trouble with getting his regex to work; it's consuming too much input. And even if he fixes that, there will be tags with extra whitespace somewhere that he didn't account for, or arguments in a different order, or any number of other problems. By the time you fix you regex to account for all of those variations, it's far, far easier to just run you input through a real parser, and select your element using the XPath expression `a[@class="fetch-me"]` or CSS query `a.fetch-me` (depending on which your HTML or XML parser library supports).
Brian Campbell
HTML and XML parsing is a solved problem. The libraries have been written. Why reinvent the wheel badly? Just use the libraries that already exist! http://docs.php.net/manual/en/class.domxpath.php
Brian Campbell
A: 

one way

$str= <<<A
blah blah
blah
...
<a class="fetch-me" href="products/1">Find ME!!!</a>
<a class="fetch-me" href="products/2">Me too!</a>
blah
blah
<a class="fetch-me"
          href="products/1">Find me, i am at next line!!!</a> blah blah
A;
$s = explode("</a>",$str);
foreach ($s as $k ){
    if (strpos($k,"href" ) !==FALSE ){
        print "--> ". preg_replace("/^.*href=\".*\">|\">.*/sm","",$k)."\n";
    }
}

output

$ php test.php
--> Find ME!!!
--> Me too!
--> Find me, i am at next line!!!

Ideally, you should use an actual parser, like everybody else said.

ghostdog74
+3  A: 
<?php

$str = '
<a class="fetch-me" href="products/1">Find ME!!!</a>
...
<a class="fetch-me" href="products/2">Me too!</a>
';

$doc = new DOMDocument();
$doc->loadHTML($str);
$xp = new DOMXpath($doc);
$query = $xp->evaluate('//a[@class="fetch-me"]');

if ( $query->length > 0 ) {
    foreach ($query as $anchor ) {
    echo $anchor->nodeValue . '<br>';
    }
}

You can also use @contains in combination with @class if multiple class values matter, you can always use an abstracted high level wrapper for DOM as well.

meder
This is the answer. Ignore my answer (other than the part about not using regexes), and use this. I don't know PHP, so I can't write up an example of how to use their HTML parser and XPath libraries off the top of my head, but in any language, the answer is to use the HTML or XML parser that already exists in your language.
Brian Campbell
A: 

I've tried all of these answers and everyone's probably right. I am going to refactor to use HTML Tidy and a real parser.

Thanks for the suggestions.

Craig Gardner