tags:

views:

3593

answers:

6

Hi,

I need help to replace all \n (new line) caracters for
in a String, but not those \n inside [code][/code] tags. My brain is burning, I can't solve this by my own :(

Example:

test test test
test test test
test
test

[code]some
test
code
[/code]

more text

Should be:

test test test<br />
test test test<br />
test<br />
test<br />
<br />
[code]some
test
code
[/code]<br />
<br />
more text<br />

Thanks for your time. Best regards.

+6  A: 

I would suggest a (simple) parser, and not a regular expression. Something like this (bad pseudocode):

stack elementStack;

foreach(char in string) {
    if(string-from-char == "[code]") {
        elementStack.push("code");
        string-from-char = "";
    }

    if(string-from-char == "[/code]") {
        elementStack.popTo("code");
        string-from-char = "";
    }

    if(char == "\n" && !elementStack.contains("code")) {
        char = "<br/>\n";
    }
}
strager
+4  A: 

You've tagged the question regex, but this may not be the best tool for the job.

You might be better using basic compiler building techniques (i.e. a lexer feeding a simple state machine parser).

Your lexer would identify five tokens: ("[code]", '\n', "[/code]", EOF, :all other strings:) and your state machine looks like:

state    token    action
------------------------
begin    :none:   --> out
out      [code]   OUTPUT(token), --> in
out      \n       OUTPUT(break), OUTPUT(token)
out      *        OUTPUT(token)
in       [/code]  OUTPUT(token), --> out
in       *        OUTPUT(token)
*        EOF      --> end

EDIT: I see other poster discussing the possible need for nesting the blocks. This state machine won't handle that. For nesting blocks, use a recursive decent parser (not quite so simple but still easy enough and extensible).

EDIT: Axeman notes that this design excludes the use of "[/code]" in the code. An escape mechanism can be used to beat this. Something like add '\' to your tokens and add:

state    token    action
------------------------
in       \        -->esc-in
esc-in   *        OUTPUT(token), -->in
out      \        -->esc-out
esc-out  *        OUTPUT(token), -->out

to the state machine.

The usual arguments in favor of machine generated lexers and parsers apply.

dmckee
That's not too bad, but it doesn't allow the code to use the string "[/code]", or have this value in comments. However, some of us got used to writing '</' + 'script>' in JavaScript as well. Still, it wouldn't let the code just be the code.
Axeman
True enough. But the OP hasn't identified an escape mechanism for the code block. "Oh what a tangled web we weave, when first we practice to language design." Or something like that.
dmckee
+1  A: 

To get it right, you really need to make three passes:

  1. Find [code] blocks and replace them with a unique token + index (saving the original block), e.g., "foo [code]abc[/code] bar[code]efg[/code]" becomes "foo TOKEN-1 barTOKEN-2"
  2. Do your newline replacement.
  3. Scan for escape tokens and restore the original block.

The code looks something* like:

Matcher m = escapePattern.matcher(input);
while(m.find()) {
    String key = nextKey();
    escaped.put(key,m.group());
    m.appendReplacement(output1,"TOKEN-"+key);
}
m.appendTail(output1);
Matcher m2 = newlinePatten.matcher(output1);
while(m2.find()) {
    m.appendReplacement(output2,newlineReplacement);
}
m2.appendTail(output2);
Matcher m3 = Pattern.compile("TOKEN-(\\d+)").matcher(output2); 
while(m3.find()) {
    m.appendReplacement(finalOutput,escaped.get(m3.group(1)));
}
m.appendTail(finalOutput);

That's the quick and dirty way. There are more efficient ways (others have mentioned parser/lexers), but unless you're processing millions of lines and your code is CPU bound (rather than I/O bound, like most webapps) and you've confirmed with a profiler that this is the bottleneck, they probably aren't worth it.

* I haven't run it, this is all from memory. Just check the API and you'll be able to work it out.

noah
You're right about the cost of writing a lexer/parser, but they scale well with complexity of the problem statement as well. And that might get bigger than this.
dmckee
+2  A: 

As mentioned by other posters, regular expressions are not the best tool for the job because they are almost universally implemented as greedy algorithms. This means that even if you tried to match code blocks using something like:

(\[code\].*\[/code\])

Then the expression will match everything from the first [code] tag to the last [/code] tag, which is clearly not what you want. While there are ways to get around this, the resulting regular expressions are usually brittle, unintuitive, and downright ugly. Something like the following python code would work much better.

output = []
def add_brs(str):
    return str.replace('\n','<br/>\n')
# the first block will *not* have a matching [/code] tag
blocks = input.split('[code]')
output.push(add_brs(blocks[0]))
# for all the rest of the blocks, only add <br/> tags to
# the segment after the [/code] segment
for block in blocks[1:]:
    if len(block.split('[/code]'))!=1:
        raise ParseException('Too many or few [/code] tags')
    else:
        # the segment in the code block is pre, everything
        # after is post
        pre, post = block.split('[/code]')
        output.push(pre)
        output.push(add_brs(post))
# finally join all the processed segments together
output = "".join(output)

Note the above code was not tested, it's just a rough idea of what you'll need to do.

shsmurfy
For this use case, there are clearly not going to be nested [code] blocks, so a reluctant quantifier takes care of that. e.g., "\[code\].*?\[\\code]" will stop as soon as it encounters "[/code]"
noah
You're a bit wrong on the regexp description (as @noah pointed out), but the Python looks good (at least in theory).
strager
It's not a bad way to handle this problem, but it won't generalize easily if the problem becomes much more complicated. +1 anyway.
dmckee
+2  A: 

This seems to do it:

private final static String PATTERN = "\\*+";

public static void main(String args[]) {
    Pattern p = Pattern.compile("(.*?)(\\[/?code\\])", Pattern.DOTALL);
    String s = "test 1 ** [code]test 2**blah[/code] test3 ** blah [code] test * 4 [code] test 5 * [/code] * test 6[/code] asdf **";
    Matcher m = p.matcher(s);
    StringBuffer sb = new StringBuffer(); // note: it has to be a StringBuffer not a StringBuilder because of the Pattern API
    int codeDepth = 0;
    while (m.find()) {
        if (codeDepth == 0) {
            m.appendReplacement(sb, m.group(1).replaceAll(PATTERN, ""));
        } else {
            m.appendReplacement(sb, m.group(1));
        }
        if (m.group(2).equals("[code]")) {
            codeDepth++;
        } else {
            codeDepth--;
        }
        sb.append(m.group(2));
    }
    if (codeDepth == 0) {
        StringBuffer sb2 = new StringBuffer();
        m.appendTail(sb2);
        sb.append(sb2.toString().replaceAll(PATTERN, ""));
    } else {
        m.appendTail(sb);
    }
    System.out.printf("Original: %s%n", s);
    System.out.printf("Processed: %s%n", sb);
}

Its not a straightforward regex but I don't think you can do what you want with a straightforward regex. Not with handling nested elements and so forth.

cletus
Good one, but I doubt [code] tags can be nested (in standard BBCode, at least).
PhiLho
True, but the point is the algorithm can be modified to handle arbitrary nesting correctly.
cletus
+2  A: 

It is hard because if regexes are good at finding something, they are not so good at matching everything except something... So you have to use a loop, I doubt you can do that in one go.

After searching, I found something close of cletus's solution, except I supposed code block cannot be nested, leading to simpler code: choose what is suited to your needs.

import java.util.regex.*;

class Test
{
  static final String testString = "foo\nbar\n[code]\nprint'';\nprint{'c'};\n[/code]\nbar\nfoo";
  static final String replaceString = "<br>\n";
  public static void main(String args[])
  {
    Pattern p = Pattern.compile("(.+?)(\\[code\\].*?\\[/code\\])?", Pattern.DOTALL);
    Matcher m = p.matcher(testString);
    StringBuilder result = new StringBuilder();
    while (m.find()) 
    {
      result.append(m.group(1).replaceAll("\\n", replaceString));
      if (m.group(2) != null)
      {
        result.append(m.group(2));
      }
    }
    System.out.println(result.toString());
  }
}

Crude quick test, you need more (null, empty string, no code tag, multiple, etc.).

PhiLho