tags:

views:

763

answers:

3

I am looking for the best br2nl function. I would like to replace all instances of <br> and <br /> with newlines \n. Much like the nl2br function but the opposite.

I know there are several solutions in the PHP manual comments but I'm looking for feedback from the SO community on possible solutions.

+9  A: 

I would generally say "don't use regex to work with HTML", but, on this one, I would probably go with a regex, considering that <br> tags generally look like either :

  • <br>
  • or <br/>, with any number of spaces before the /


I suppose something like this would do the trick :

$html = 'this <br>is<br/>some<br />text <br    />!';
$nl = preg_replace('#<br\s*/?>#i', "\n", $html);
echo $nl;

Couple of notes :

  • starts with <br
  • followed by any number of white characters : \s*
  • optionnaly, a / : /?
  • and, finally, a >
  • and this using a case-insensitive match, as <BR> would be valid in HTML
Pascal MARTIN
That's a great explanation of the regex.
Echo
+1 for breaking down the regex.
markb
To be very nit-picky =] : `<input type="text" value="<br />">` is allowed in html (not xhtml). And in a CDATA section `<br />` is "normal" text.
VolkerK
@VolkerK : humph, true :-) ;; I was writting this using DOM, and when I finished, I saw you posted the same kind of solution I would have proposed *(excepts I used getElementsByName, and not XPath)*, so didn't post it -- maybe I should edit my answer, though, for the sake of completness, as it's been accepted...
Pascal MARTIN
@Pascal: But this solution is faster and less memory consuming (if this is a matter). If you don't have _completely_ arbitrary documents I'd probably consider these edge-cases acceptable.
VolkerK
+1  A: 

If the document is well-formed (or at least well-formed-ish) you can use the DOM extension and xpath to find and replace all br elements by a \n text node.

$in = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd"&gt;
<html><head><title>...</title></head><body>abc<br />def<p>ghi<br />jkl</p></body></html>';

$doc = new DOMDOcument;
$doc->loadhtml($in);
$xpath = new DOMXPath($doc);

$toBeReplaced = array();
foreach($xpath->query('//br') as $node) {
    $toBeReplaced[] = $node;
}

$linebreak = $doc->createTextNode("\n");
foreach($toBeReplaced as $node) {
    $node->parentNode->replaceChild($linebreak->cloneNode(), $node);
}

echo $doc->savehtml();

prints

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"&gt;
<html>
<head><title>...</title></head>
<body>abc
def<p>ghi
jkl</p>
</body>
</html>

edit: shorter version with only one iteration

$in = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd"&gt;
<html><head><title>...</title></head><body>abc<br />def<p>ghi<br />jkl</p></body></html>';

$doc = new DOMDOcument;
$doc->loadhtml($in);
$xpath = new DOMXPath($doc);

$linebreak = $doc->createTextNode("\n");
foreach($xpath->query('//br') as $node) {
  $node->parentNode->removeChild($node);
}

echo $doc->savehtml();
VolkerK
You don’t need to do two rounds. You can replace the nodes with the first `foreach`.
Gumbo
That seems to be so ;-) For some (unknown) reason I remembered it to break the xpath iterator.
VolkerK
A: 

From the nl2br comments:

<?php
function br2nl($string){
  $return=eregi_replace('<br[[:space:]]*/?'.
    '[[:space:]]*>',chr(13).chr(10),$string);
  return $return;
}
?> 
ssergei
the posix regular expression module has been deprecated. From the ereg\_replace manual page: "This function has been DEPRECATED as of PHP 5.3.0 and REMOVED as of PHP 6.0.0. Relying on this feature is highly discouraged."
VolkerK