tags:

views:

723

answers:

5

Ok I have to parse out a SOAP request and in the request some of the values are passed with (or inside) a Anchor tag. Looking for a RegEx (or alt method) to strip the tag and just return the value.

// But item needs to be a RegEx of some sort, it's a field right now
if($sObject->list == 'item') {
   // Split on > this should be the end of the right side of the anchor tag
   $pieces = explode(">", $sObject->fields->$field);

   // Split on < this should be the closing anchor tag
   $piece = explode("<", $pieces[1]);

   $fields_string .= $piece[0] . "\n";
}

item is a field name but I would like to make this a RegEx to check for the Anchor tag instead of a specific field.

+3  A: 

PHP has a strip_tags() function.

Alternatively you can use filter_var() with FILTER_SANITIZE_STRING.

Whatever you do don't parse HTML/XML with regular expressions. It's really error-prone and flaky. PHP has at least 3 different parsers as standard (SimpleXML, DOMDocument and XMLReader spring to mind).

cletus
I'm using DOMDocument as well but how do I strip the tags?
Phill Pafford
strip_tags() or filter_var() if you just want to remove tags. The others if you want to parse them in some way.
cletus
Just what I wanted, thnx
Phill Pafford
A: 

use simplexml and xpath to retrieve the desired nodes

w35l3y
A: 
NawaMan
I'm using DOMDocument as well but how do I strip the tags?
Phill Pafford
Please see my edit.
NawaMan
A: 

If you don't have some kind of request<->class mapping you can extract the information with the DOM extension. The property textConent contains all the text of the context node and its descendants.

$sr = '<?xml version="1.0"?>
<SOAP:Envelope xmlns:SOAP="urn:schemas-xmlsoap-org:soap.v1">
  <SOAP:Body>
    <foo:bar xmlns:foo="urn:yaddayadda">
       <fragment>
         <a href="....">Mary</a> had a
         little <a href="....">lamb</a>
       </fragment>
    </foo:bar>
  </SOAP:Body>
</SOAP:Envelope>';

$doc = new DOMDocument;
$doc->loadxml($sr);

$xpath = new DOMXPath($doc);
$ns = $xpath->query('//fragment');
if ( 0 < $ns->length ) {
  echo $ns->item(0)->nodeValue;
}

prints

Mary had a
little lamb
VolkerK
A: 

I agree with cletus, using RegEx on HTML is bad practice because of how loose HTML is as a language (and I moan about PHP being too loose...). There are just so many ways you can variate a tag that unless you know that the document is standards-compliant / strict, it is sometimes just impossible to do. However, because I like a challenge that distracts me from work, here's how you might do it in RegEx!

I'll split this up into sections, no point if all you see is a string and say, "Meh... It'll do..."! First we have the main RegEx for an anchor tag:

'#<a></a>#'

Then we add in the text that could be between the tags. We want to group this is parenthesis, so we can extract the string, and the question mark makes the asterix wildcard "un-greedy", meaning that the first </a> that it comes accross will be the one it uses to end the RegEx.

'#<a>(.*?)</a>#'

Next we add in the RegEx for href="". We match the href=" as plain text, then an any-length string that does not contain a quotation mark, then the ending quotation mark.

'#<a href\="([^"]*)">(.*?)</a>#'

Now we just need to say that the tag is allowed other attributes. According to the specification, an attribute can contain the following characters: [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*. Allow an attribute multiple times, and with a value, we get: ( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*.

The resulting RegEx (PCRE) is as following:

'#<a( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")* href\="([^"]*)"( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*>(.*?)</a>#'

Now, in PHP, use the preg_match_all() function to grab all occurances in the string.

$regex = '#<a( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")* href\="([^"]*)"( [a-zA-Z_\:][a-zA-Z0-9_\:\.-]*\="[^"]*")*>(.*?)</a>#';
preg_match_all($regex, $str_containing_anchors, $result);
foreach($result as $link)
 {
  $href = $link[2];
  $text = $link[4];
 }
mynameiszanders