views:

524

answers:

4

Let's say I have a bunch of HTML like below:

bla bla bla long paragraph here
<br/>
<br/>
bla bla bla more paragraph text
<br/>
<br/>

Is there an easy way with Javascript to convert it to properly semantic <p> tags? E.g.:

<p>
  bla bla bla long paragraph here
</p>
<p>
  bla bla bla more paragraph text
</p>

Output spacing is not important, ideally it will work with any input spacing.

I'm thinking I might try to cook up a regex, but before I do that I wanted to make sure I was a) avoiding a world of hurt and b) there wasn't something else out there - I'd tried to do a google search but haven't yet come up with anything.

Thanks for any advice!

A: 

I'd do it in several stages:

  1. RegExp: Convert all br-tags to line-breaks.
  2. RegExp: Strip out all the white-space.
  3. RegExp: Convert the multiple line-breaks to single ones.
  4. Use Array.split('\n') on the result.

That should give an array with all the 'real' paragraphs (in theory.) Then you can just iterate through it and wrap each line in p-tags.

brownstone
that could cause problems, since having multiple line breaks in your HTML is insignificant. Step 3 might create a series of unwanted paragraphs.
nickf
Sometimes you actually want a single <br> element to remain, your step 1 will remove it.
Jason Berry
@nickf: Step 3 converts the multiple line-breaks to single ones, so I don't quite know what you mean. @Jason: True enough. The node-based solution posted above is much more versatile.
brownstone
+2  A: 

Scan each of the child elements + text of the enclosing element. Each time you encounter a "br" element, create a "p" element, and append all pending stuff to it. Lather, rinse, repeat.

Don't forget to remove the stuff which you are relocating to a new "p" element.

I have found this library (prototype.js) to be useful for this sort of thing.

Roboprog
+3  A: 

I got bored. I'm sure there are optimizations / tweaks needed. Uses a little bit of jQuery to do its magic. Worked in FF3. And the answer to your question is that there isnt a very "simple" way :)

$(function() {
  $.fn.pmaker = function() {
    var brs = 0;
    var nodes = [];

    function makeP()
    {
      // only bother doing this if we have nodes to stick into a P
      if (nodes.length) {
        var p = $("<p/>");
        p.insertBefore(nodes[0]);  // insert a new P before the content
        p.append(nodes); // add the children        
        nodes = [];
      }
      brs=0;
    }

    this.contents().each(function() {    
      if (this.nodeType == 3) // text node 
      {
        // if the text has non whitespace - reset the BR counter
        if (/\S+/.test(this.data)) {
          nodes.push(this);
          brs = 0;
        }
      } else if (this.nodeType == 1) {
        if (/br/i.test(this.tagName)) {
          if (++brs == 2) {
            $(this).remove(); // remove this BR from the dom
            $(nodes.pop()).remove(); // delete the previous BR from the array and the DOM
            makeP();
          } else {
            nodes.push(this);
          }
        } else if (/^(?:p)$/i.test(this.tagName)) {
          // these tags for the P break but dont scan within
          makeP();
        } else if (/^(?:div)$/i.test(this.tagName)) {
          // force a P break and scan within
          makeP();
          $(this).pmaker();
        } else {
          brs = 0; // some other tag - reset brs.
          nodes.push(this); // add the node 
          // specific nodes to not peek inside of - inline tags
          if (!(/^(?:b|i|strong|em|span|u)$/i.test(this.tagName))) {
            $(this).pmaker(); // peek inside for P needs            
          }
        } 
      } 
    });
    while ((brs--)>0) { // remove any extra BR's at the end
      $(nodes.pop()).remove();
    }
    makeP();
    return this;
  };

  // run it against something:
  $(function(){ 
    $("#worker").pmaker();
  });

And this was the html portion I tested against:

<div id="worker">
bla bla bla long <b>paragraph</b> here
<br/>
<br/>
bla bla bla more paragraph text
<br/>
<br/>
this text should end up in a P
<div class='test'>
  and so should this
  <br/>
  <br/>
  and this<br/>without breaking at the single BR
</div>
and then we have the a "buggy" clause
<p>
  fear the real P!
</p>
and a trailing br<br/>
</div>

And the result:

<div id="worker"><p>
bla bla bla long <b>paragraph</b> here
</p>
<p>
bla bla bla more paragraph text
</p>
<p>
this text should end up in a P
</p><div class="test"><p>
  and so should this
  </p>
  <p>
  and this<br/>without breaking at the single BR
</p></div><p>
and then we have the a "buggy" clause
</p><p>
  fear the real P!
</p><p>
and a trailing br</p>
</div>
gnarf
P.s. if you like this answer, please +1 roboprog, he inspired it.
gnarf
Thanks for the reference / credit.
Roboprog
+1  A: 

I'm assuming you're not really allowing any other Sometimes you need to preserve single line-breaks (not all <br /> elements are bad), and you only want to turn double instances of <br /> into paragraph breaks.

In doing so I would:

  1. Remove all line breaks
  2. Wrap the whole lot in a paragraph
  3. Replace <br /><br /> with </p>\n<p>
  4. Lastly, remove any empty <p></p> elements that might have been generated

So the code could look something like:

var ConvertToParagraphs = function(text) {
    var lineBreaksRemoved = text.replace(/\n/g, "");
    var wrappedInParagraphs = "<p>" + lineBreaksRemoved + "</p>";
    var brsRemoved = wrappedInParagraphs.replace(/<br[^>]*>[\s]*<br[^>]*>/gi, "</p>\n<p>");
    var emptyParagraphsRemoved = brsRemoved.replace(/<p><\/p>/g, "");
    return emptyParagraphsRemoved;
}

Note: I've been exceedingly verbose to show the processes, you'd simplify it of course.

This turns your sample:

bla bla bla long paragraph here
<br/>
<br/>
bla bla bla more paragraph text
<br/>
<br/>

Into:

<p>bla bla bla long paragraph here</p>
<p>bla bla bla more paragraph text</p>

But it does so without removing any <br /> elements that you may actually want.

Jason Berry