ansaurus

Question

PHP RegEx (or Alt Method) for Anchor tags

Answer 1

+3 A:

PHP has a strip_tags() function.

Alternatively you can use filter_var() with FILTER_SANITIZE_STRING.

Whatever you do don't parse HTML/XML with regular expressions. It's really error-prone and flaky. PHP has at least 3 different parsers as standard (SimpleXML, DOMDocument and XMLReader spring to mind).

cletus 2009-09-09 13:47:41

I'm using DOMDocument as well but how do I strip the tags?

Phill Pafford 2009-09-09 13:58:09

strip_tags() or filter_var() if you just want to remove tags. The others if you want to parse them in some way.

cletus 2009-09-09 13:59:34

Just what I wanted, thnx

Phill Pafford 2009-09-09 19:28:49

Answer 2

A:

use simplexml and xpath to retrieve the desired nodes

w35l3y 2009-09-09 13:50:05

Answer 3

A:

NawaMan 2009-09-09 13:54:30

I'm using DOMDocument as well but how do I strip the tags?

Phill Pafford 2009-09-09 13:58:49

Please see my edit.

NawaMan 2009-09-09 14:19:26

Answer 4

A:

If you don't have some kind of request<->class mapping you can extract the information with the DOM extension. The property textConent contains all the text of the context node and its descendants.

$sr = '<?xml version="1.0"?>
<SOAP:Envelope xmlns:SOAP="urn:schemas-xmlsoap-org:soap.v1">
  <SOAP:Body>
    <foo:bar xmlns:foo="urn:yaddayadda">
       <fragment>
         <a href="....">Mary</a> had a
         little <a href="....">lamb</a>
       </fragment>
    </foo:bar>
  </SOAP:Body>
</SOAP:Envelope>';

$doc = new DOMDocument;
$doc->loadxml($sr);

$xpath = new DOMXPath($doc);
$ns = $xpath->query('//fragment');
if ( 0 < $ns->length ) {
  echo $ns->item(0)->nodeValue;
}

prints

Mary had a
little lamb

VolkerK 2009-09-09 13:59:55

Answer 5

A:

I agree with cletus, using RegEx on HTML is bad practice because of how loose HTML is as a language (and I moan about PHP being too loose...). There are just so many ways you can variate a tag that unless you know that the document is standards-compliant / strict, it is sometimes just impossible to do. However, because I like a challenge that distracts me from work, here's how you might do it in RegEx!

I'll split this up into sections, no point if all you see is a string and say, "Meh... It'll do..."! First we have the main RegEx for an anchor tag:

'#<a></a>#'

Then we add in the text that could be between the tags. We want to group this is parenthesis, so we can extract the string, and the question mark makes the asterix wildcard "un-greedy", meaning that the first </a> that it comes accross will be the one it uses to end the RegEx.

'#<a>(.*?)</a>#'

Next we add in the RegEx for href="". We match the href=" as plain text, then an any-length string that does not contain a quotation mark, then the ending quotation mark.

'#<a href\="([^"]*)">(.*?)</a>#'

Now we just need to say that the tag is allowed other attributes. According to the specification, an attribute can contain the following characters: [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*. Allow an attribute multiple times, and with a value, we get: ( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*.

The resulting RegEx (PCRE) is as following:

'#<a( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")* href\="([^"]*)"( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*>(.*?)</a>#'

Now, in PHP, use the preg_match_all() function to grab all occurances in the string.

$regex = '#<a( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")* href\="([^"]*)"( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*>(.*?)</a>#';
preg_match_all($regex, $str_containing_anchors, $result);
foreach($result as $link)
 {
  $href = $link[2];
  $text = $link[4];
 }

mynameiszanders 2009-09-09 14:10:06

ansaurus

tags:

views:

answers:

PHP RegEx (or Alt Method) for Anchor tags

related questions