You shouldn’t do that with regular expressions – at least not regular expressions only. Use a proper HTML DOM parser like the one of PHP’s DOM library instead. You then can iterate the nodes, check if it’s a text node and do the regular expression search and replace the text node appropriately.
Something like this should do it:
$pattern = "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i";
$doc = new DOMDocument();
$doc->loadHTML($str);
// for every element in the document
foreach ($doc->getElementsByTagName('*') as $elem) {
// for every child node in each element
foreach ($elem->childNodes as $node) {
if ($node->nodeType === XML_TEXT_NODE) {
// split the text content to get an array of 1+2*n elements for n URLs in it
$parts = preg_split($pattern, $node->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
$n = count($parts);
if ($n > 1) {
$parentNode = $node->parentNode;
// insert for each pair of non-URL/URL parts one DOMText and DOMElement node before the original DOMText node
for ($i=1; $i<$n; $i+=2) {
$a = $doc->createElement('a');
$a->setAttribute('href', $parts[$i]);
$a->setAttribute('target', '_blank');
$a->appendChild($doc->createTextNode($parts[$i]));
$parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
$parentNode->insertBefore($a, $node);
}
// insert the last part before the original DOMText node
$parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
// remove the original DOMText node
$node->parentNode->removeChild($node);
}
}
}
}
Ok, since the DOMNodeLists of getElementsByTagName
and childNodes
are live, every change in the DOM is reflected to that list and thus you cannot use foreach
that would also iterate the newly added nodes. Instead, you need to use for
loops instead and keep track of the elements added to increase the index pointers and at best pre-calculated array boundaries appropriately.
But since that is quite difficult in such a somehow complex algorithm (you would need one index pointer and array boundary for each of the three for
loops), using a recursive algorithm is more convenient:
function mapOntoTextNodes(DOMNode $node, $callback) {
if ($node->nodeType === XML_TEXT_NODE) {
return $callback($node);
}
for ($i=0, $n=count($node->childNodes); $i<$n; ++$i) {
$nodesChanged = 0;
switch ($node->childNodes->item($i)->nodeType) {
case XML_ELEMENT_NODE:
$nodesChanged = mapOntoTextNodes($node->childNodes->item($i), $callback);
break;
case XML_TEXT_NODE:
$nodesChanged = $callback($node->childNodes->item($i));
break;
}
if ($nodesChanged !== 0) {
$n += $nodesChanged;
$i += $nodesChanged;
}
}
}
function foo(DOMText $node) {
$pattern = "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i";
$parts = preg_split($pattern, $node->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
$n = count($parts);
if ($n > 1) {
$parentNode = $node->parentNode;
$doc = $node->ownerDocument;
for ($i=1; $i<$n; $i+=2) {
$a = $doc->createElement('a');
$a->setAttribute('href', $parts[$i]);
$a->setAttribute('target', '_blank');
$a->appendChild($doc->createTextNode($parts[$i]));
$parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
$parentNode->insertBefore($a, $node);
}
$parentNode->insertBefore($doc->createTextNode($parts[$i-1]), $node);
$parentNode->removeChild($node);
}
return $n-1;
}
$str = '<div>sometext http://www.somedomain.com/index.html sometext <img src="http//domain.com/image.jpg"> sometext sometext</div>';
$doc = new DOMDocument();
$doc->loadHTML($str);
$elems = $doc->getElementsByTagName('body');
mapOntoTextNodes($elems->item(0), 'foo');
Here mapOntoTextNodes
is used to map a given callback function onto every DOMText node in a DOM document. You can either pass the whole DOMDocument node or just a specific DOMNode (in this case just the BODY
node).
The function foo
is then used to find and replace the plain URLs in the DOMText node’s content by splitting the content string into non-URL/URL parts using preg_split
while capturing the used delimiter resulting in an array of 1+2·n items. Then the non-URL parts are replaced by new DOMText nodes and the URL parts are replaced by new A
elements that are then inserted before the origin DOMText node that is then removed at the end. Since this mapOntoTextNodes
walks recursively, it suffices to just call that function on a specific DOMNode.