views:

160

answers:

5

Hello,

I've read a few questions on here re parsing HTML with regex, and I understand that this is, on the whole, a terrible idea.

Having said this, I have a very specific problem that I think Regex might be the answer to. I've been fumbling around trying to work out the answer but I'm new (today) to Regex, and I was hoping some kind hearted person may be able to help me out.

I have an array of strings that always follow the format

STUFF HERE<a href="somewhere" title="something" target="_blank">name of thing</a>STUFF HERE

What I'm hoping to achieve is to be left with just the 'somewhere' and the 'name of thing, so that I can output just <a href="somewhere">name of thing</a>.

The array of strings comes from an RSS feed of links on my Facebook profile, if you happen to be interested.

Many, many thanks for any help.

Jack

+2  A: 

I don't know PHP, but you can use the following (extremely brittle) regex:

<a href="(.+?)" title=".+?" target="_blank">(.+?)</a>

This will capture the URL and the text of the link.

If you want to be somewhat more flexible, you could allow any attributes, like this:

<a .*?href="(.+?)".*?>(.+?)</a>
SLaks
Only two things: you don't need to escape `"` and I would run that in case insensitive mode.
Alix Axel
Also fails (twice!) for URLs where there only is a `href` attribute, check my answer.
Alix Axel
@Alix: Fixed; thanks.
SLaks
+7  A: 

I understand completely where you're coming from on the pragmatism scale.

However PHP does have a very nice/straightforward HTML parser, and it seems sufficiently simple to get it to work that I'd hesitate not to recommend it.

Brian Agnew
Ooh that's neat. I'll try that and let you know how it goes. Many thanks.
Jack Shepherd
Didn't go with it in the end, but a very interesting parser, I'm sure I'll use again in the future, thanks.
Jack Shepherd
Why not use the native parser? http://www.php.net/manual/en/book.simplexml.php
troelskn
+1  A: 
$str = 'STUFF HERE<a href="somewhere" title"something" target="_blank">name of thing</a>STUFF HERE';
$success = preg_match('/.*href=\"([^\"]+)\".*>([^<]+)<.*/i', $str, $matches);
if ($success) {
    echo $matches[1];
    echo $matches[2];
} else {
    echo "Parsing failed.";
}

The parenthetical clauses isolate portions of the match for the $matches array. If the pattern matches the string at all, then $matches[1] would contain your href and $matches[2] would contain your link text.

Inside the parenthesis, I'm defining the meat of those segments you're interested with exclusion characters. The first one is [^\"]+, which is one-or-more of any character except double quote. The latter is [^<]+, which is one or more of any character except less than. This ensures that, if the markup is consistently in the format you provided, then you have well-defined boundaries on either side of the portions you're interested in.

Eric Kolb
Just trying this now, I get: preg_match() [function.preg-match]: Unknown modifier '*'"Is this my mistake?
Jack Shepherd
Nope, my bad. My PHP syntax is a little rusty since it's not my primary language these days. I've been using a language where RegEx patterns don't need the / / delimiters. I've edited my post. Hope that should be more accomodating.
Eric Kolb
Perfect! Thank you.
Jack Shepherd
A: 

SLaks regex may has some problems with URLs with no attributes other than href, here is my take:

~<a.+?href="(.+?)".*?>(.+?)</a>~i
Alix Axel
isn't `.+?` equivalent to `.*` ?
Javier
@Javier: Not at all!
Alix Axel
@Javier: No, it isn't. `.*?` and `.+?` are lazy and will match as little as possible; `.*` and `.+` are greedy and will match as much as possible.
SLaks
ah, i see. in other regexp dialects `*` means '0 or more, greedy', `+` means '1 or more, greedy', and `?` means '0 or 1, greedy'; making `.+?` equivalent to `.*`, if at all valid.
Javier
@Javier: I highly doubt that, `.+?` reads: **match any character 1 or more times, as few times as possible** while `.*` reads: **match any character 0 or more times (as many times as possible)**.
Alix Axel
A: 

I've tested with my own Facebook feed and could load it with SimpleXML. Well, partly. The RSS feed cannot be loaded directly, but if you fetch the Feed with MagPie first, you can then load the description element with SimpleXml like this:

$xml = simplexml_load_string($description); // load description
$link = $xml->xpath('//a');                 // find all links inside
$href = (string) $link[0]['href'];          // get URL
$text = (string) $link[0];                  // and link text

As long as Facebook does not break the HTML inside the description, it is safe to use SimpleXml. If they break it, SimpleXml will complain.

Gordon
that would have been a much quicker way around, damn. In the end in settled on something totally different.... function get($a,$b,$c){ // Gets a string between 2 strings $y = explode($b,$a); $x = explode($c,$y[1]); return $x[0]; }then just ran that on the description, separating the links and the titles from the rest of the FB stuff.Ugly, but it works. I think I was barking down the whole wrong tree with Regexe, I should think more about my question next time...
Jack Shepherd