ansaurus

Question

Not Another Parse-HTML-With-Regex Question

Answer 1

+2 A:

I don't know PHP, but you can use the following (extremely brittle) regex:

<a href="(.+?)" title=".+?" target="_blank">(.+?)</a>

This will capture the URL and the text of the link.

If you want to be somewhat more flexible, you could allow any attributes, like this:

<a .*?href="(.+?)".*?>(.+?)</a>

SLaks 2010-01-21 17:57:02

Only two things: you don't need to escape `"` and I would run that in case insensitive mode.

Alix Axel 2010-01-21 18:07:29

Also fails (twice!) for URLs where there only is a `href` attribute, check my answer.

Alix Axel 2010-01-21 18:11:29

@Alix: Fixed; thanks.

SLaks 2010-01-21 18:15:02

Answer 2

+7 A:

I understand completely where you're coming from on the pragmatism scale.

However PHP does have a very nice/straightforward HTML parser, and it seems sufficiently simple to get it to work that I'd hesitate not to recommend it.

Brian Agnew 2010-01-21 17:58:59

Ooh that's neat. I'll try that and let you know how it goes. Many thanks.

Jack Shepherd 2010-01-21 18:07:44

Didn't go with it in the end, but a very interesting parser, I'm sure I'll use again in the future, thanks.

Jack Shepherd 2010-01-21 18:28:15

Why not use the native parser? http://www.php.net/manual/en/book.simplexml.php

troelskn 2010-01-21 21:02:01

Answer 3

+1 A:

$str = 'STUFF HERE<a href="somewhere" title"something" target="_blank">name of thing</a>STUFF HERE';
$success = preg_match('/.*href=\"([^\"]+)\".*>([^<]+)<.*/i', $str, $matches);
if ($success) {
    echo $matches[1];
    echo $matches[2];
} else {
    echo "Parsing failed.";
}

The parenthetical clauses isolate portions of the match for the $matches array. If the pattern matches the string at all, then $matches[1] would contain your href and $matches[2] would contain your link text.

Inside the parenthesis, I'm defining the meat of those segments you're interested with exclusion characters. The first one is [^\"]+, which is one-or-more of any character except double quote. The latter is [^<]+, which is one or more of any character except less than. This ensures that, if the markup is consistently in the format you provided, then you have well-defined boundaries on either side of the portions you're interested in.

Eric Kolb 2010-01-21 18:05:26

Just trying this now, I get: preg_match() [function.preg-match]: Unknown modifier '*'"Is this my mistake?

Jack Shepherd 2010-01-21 18:16:12

Nope, my bad. My PHP syntax is a little rusty since it's not my primary language these days. I've been using a language where RegEx patterns don't need the / / delimiters. I've edited my post. Hope that should be more accomodating.

Eric Kolb 2010-01-21 18:23:26

Perfect! Thank you.

Jack Shepherd 2010-01-21 18:29:04

Answer 4

A:

SLaks regex may has some problems with URLs with no attributes other than href, here is my take:

~<a.+?href="(.+?)".*?>(.+?)</a>~i

Alix Axel 2010-01-21 18:09:40

isn't `.+?` equivalent to `.*` ?

Javier 2010-01-21 18:19:53

@Javier: Not at all!

Alix Axel 2010-01-21 18:21:53

@Javier: No, it isn't. `.*?` and `.+?` are lazy and will match as little as possible; `.*` and `.+` are greedy and will match as much as possible.

SLaks 2010-01-21 18:23:14

ah, i see. in other regexp dialects `*` means '0 or more, greedy', `+` means '1 or more, greedy', and `?` means '0 or 1, greedy'; making `.+?` equivalent to `.*`, if at all valid.

Javier 2010-01-21 18:34:53

@Javier: I highly doubt that, `.+?` reads: **match any character 1 or more times, as few times as possible** while `.*` reads: **match any character 0 or more times (as many times as possible)**.

Alix Axel 2010-01-21 18:39:31

Answer 5

A:

I've tested with my own Facebook feed and could load it with SimpleXML. Well, partly. The RSS feed cannot be loaded directly, but if you fetch the Feed with MagPie first, you can then load the description element with SimpleXml like this:

$xml = simplexml_load_string($description); // load description
$link = $xml->xpath('//a');                 // find all links inside
$href = (string) $link[0]['href'];          // get URL
$text = (string) $link[0];                  // and link text

As long as Facebook does not break the HTML inside the description, it is safe to use SimpleXml. If they break it, SimpleXml will complain.

Gordon 2010-01-21 18:44:50

that would have been a much quicker way around, damn. In the end in settled on something totally different.... function get($a,$b,$c){ // Gets a string between 2 strings $y = explode($b,$a); $x = explode($c,$y[1]); return $x[0]; }then just ran that on the description, separating the links and the titles from the rest of the FB stuff.Ugly, but it works. I think I was barking down the whole wrong tree with Regexe, I should think more about my question next time...

Jack Shepherd 2010-01-21 19:09:09

ansaurus

tags:

views:

answers:

Not Another Parse-HTML-With-Regex Question

related questions