views:

502

answers:

2

I'm trying to build a bbcode parser, but I'm having quite some problems figuring out how to avoid matching too widely. For example I want to implement a [list] to conversion like this:

\[list\](.*)\[/list\]

would be replaced by this:

<ul>$1</ul>

This works fine, except if I have two lists where the regular expression matches the beginning tag of the first list and the ending tag of the second. So this

[list]list1[/list] [list]list2[/list]

becomes this:

<ul>list1[/list] [list]list2</ul>

which produces really ugly output. Any idea on how to fix this?

+7  A: 

The method you're using may not end up being a particularly good approach, but to solve that specific problem, just change to non-greedy matching:

\[list\](.*?)\[\/list\]

Note that this way will have trouble with nested lists instead of back-to-back ones.

Chad Birch
+4  A: 

If what you are doing is not just a lightweight hack, but something more permanent, you probably want to move to a real parser. Regexps in Java are particularly slow (even with precompiled patterns) and matching nested constructs (especially different nested contructs like "foo [u][i] bar [s]baz[/s][/i][/u]" ) is going to be a royal pain.

Instead, try using a state-based parser, that repeatedly cuts your sentence in sections like "foo " / (u) / "[i] bar [s]baz[/s][/i][/u]", and maintains a set of states that flip whenever you encounter the matching construct delimiter.

Varkhan
Thanks for the heads up, do you know some resources or a working example of such a parser? Speed really is my main concern ^^
cdecker
Java's built-in regexes are plenty fast enough if you know what you're doing. I agree that regexes are not the right tool for this job, but performance is not the reason.
Alan Moore
Even something as simple as matching a prefix with a precompiled Java regexp is painfully slow (printing all matching lines from a file, with a regexp like "^mystring": two orders of magnitude compared to a simple startswith, an order of magnitude compared with the same program in Perl... wtf?)
Varkhan