views:

179

answers:

4

I've got a bunch of text like this:

foo
bar

baz

What's likely to be the most efficient way in C++ of transforming that to this:

<p>foo<br />bar</p>
<p>baz</p>

for large(ish) quantities of text (up to 8000 characters).

I'm happy to use boost's regex_replace, but I was wondering if string searching for \n\n might be more efficient? Any thoughts? Any other approaches?

Most third-party libraries are not available to me in the environment I'm working in.

+5  A: 

I would use a simple state-machine. It does require comparison of the state for each time through the loop, but it should not matter (it could be optimised by having a sub loop in the third state - see below). The start state would be the same as when two newlines have be encountered. There would be a variable for the previous character and one for keeping track of the position of the last newline (used for generating output).

The states would be:

  • encountered double new line. Action when enter into state: output of <p>, the line and </p>

  • encountered single new line. Action when enter into state: output of the line and

  • encountered normal character

The program would look more like a C-program, though...

Peter Mortensen
+1 - Got a simple version of this working (though I've adapted it slightly - start by pushing <p>, and the start state is "no newlines encountered". On encountering a newline I set "newline encountered", unless it was already set, at which point I unset it, and push </p><p>. If I encounter a non-newline character x, I push <br />x, and unset "newline encountered". However, I think Vinay's answer might be a bit quicker and easier to manage.
Dominic Rodger
+1  A: 

If your data contains no surprises, you can just replace all instances of \n\n with </p><p>, followed by replacing all \n with <br/>. Then bracket the result with <p> and </p>, and you're done. This doesn't deal with edge cases (for example, three newlines separating paragraphs) but it is pretty simple, and quicker than writing a state machine!

Update: Obviously, if you have \n\n\n, \n\n\n\n etc. then you can also replace those with </p><p> starting with the longer sequences first.

Vinay Sajip
+1 - That looks like it'll work well and be very quick - thanks. For some reason that (now obvious-seeming) solution hadn't occurred to me.
Dominic Rodger
Sadly, a lot of my input has three newlines separating paragraphs. D'oh!
Dominic Rodger
I wonder if a `regex_replace` might be more appropriate to replace longer sequences of `\n\n...` with `</p><p>`.
Dominic Rodger
I'd say, go for it!
Vinay Sajip
+2  A: 

Don't forget to encode your text for HTML entities! e.g. if you have

foo&

you'll need to translate it appropriately:

foo&amp;

(don't know if you're aware - it's just not been mentioned and often gets forgotten!)

Brian Agnew
+1 - hadn't forgotten, but thanks for reminding me :-)
Dominic Rodger
Just think of the above as reminding others :-)
Brian Agnew
A: 

Tight, fast, and ugly state machine. Handles degenerate cases, like empty input, blank lines at the beginning of the input, long strings of blank lines between paragraphs, and a missing newline marker at the end of the input.

template <typename InputIt, typename OutputIt>
void TextToHTML(InputIt begin, InputIt end, OutputIt target) {
start:  if (begin == end) return;
        if (*begin == '\n') { ++begin; goto start; }
        *target++ = '<'; *target++ = 'p'; *target++ = '>';
para:   *target++ = *begin++;
        if (begin == end) goto endp;
        if (*begin != '\n') goto para;
        if (++begin == end) goto endp;
        if (*begin == '\n') goto endp;
        *target++ = '<'; *target++ = 'b'; *target++ = 'r'; *target++ = ' '; *target++ = '/'; *target++ = '>';
        goto para;
endp:   *target++ = '<'; *target++ = '/'; *target++ = 'p'; *target++ = '>'; *target++ = '\n';
        goto start;
}

int main() {
    std::string text = "foo\nbar\n\nbaz";
    std::string html;
    TextToHTML(text.begin(), text.end(), std::back_inserter(html));
    std::cout << html << std::endl;
    return 0;
}
Adrian McCarthy