+2  A: 

Your example seems fine to me.

 > Line 1 > Line 2 >> Line 2.1 >> Line 2.2 > Line 3

Unfortunately, pure RegEx can't keep track of which nesting level you are on, so it won't know where to put the /UL close tags.

Something like this might work:

 * Line 1 * Line 2 > * Line 2.1 * Line 2.2 < * Line 3

Here, the greater-than and less-than move up and down the hierarchy, and the asterisks are the delimiters for the bullets. The spaces before and after each are used as a sort of escape sequence, so you can still use those characters literally or for other purposes like italics and bold when they aren't surrounded by spaces.

A stab at the RegEx:

 string ol = "<ul>" & RegEx.Replace(t, " > ", "<ul>") & "</ul>";
 ol = RegEx.Replace(ol, " < ", "</ul>");
 ol = RegEx.Replace(ol, "( |>)\\* ([^*<>]*)", "<li>\\2</li>");

Edit: Adjusted to produce XHTML, closing the LI tags, based on comment below. Also fixed my C# syntax.

Final edit: I think the \ * and \ 2 in the last Replace need to be escaped for C#, fixing. Also, note that the first two Replace() calls can use String.Replace() rather than RegEx, which will likely be faster.

richardtallent
I wasn't sure if I could use backreferences to write the closing tags.
Dave Jarvis
if you replace "-([^-<>]*)" with "<li>\1</li>" you will get valid xhtml
cobbal
cobbal, yes, but that pattern would replace all hyphens, not just the ones intended to be bullets. Using a hyphen was probably a bad choice since it is common in English anyway--an asterisk would be just as intuitive and would work better for your enhancement.
richardtallent
cobbal, I adjusted my code to use asterisks, made a tweak to your suggestion so it would work in the code above (needed space-or-greater-than because of the replacements above that potentially strip the space before the LI character).
richardtallent
Thanks for the idea, Richard. Got me on the path to a working solution.
Dave Jarvis
A: 

I would not recommend using regular expressions as a parsing and transformation tool. Regular expressions tend to have high overhead, and are not the most efficient means of parsing a language...which is what you are really asking it to do. You have created a language, simple as it is, and you should treat it as such. I recommend writing an actual, dedicated parser for your WIKI-style formatting code. Since you can target the parser specifically to your language, it should be more efficient. In addition, you won't have to create some frightening monstrosity that is a regex to parse your language and handle all of its nuances. In the long run, you gain the benefits of clearer code, better maintainability, etc.

I suggest the following resources:

jrista
It's a one-off to convert a spreadsheet cell into a database value.
Dave Jarvis
I see. Regex is a possibility, however, I am not sure if will actually meet your needs. Antler might still be helpful. It is a parser generator, and you're grammar seems fairly simple. Even though its a one-off, you might still be able to use ANTLER to generate a parser, and write a simple C# program to perform the conversion, in less time than it takes to figure out the RegEx. ;)
jrista
Must be a Java solution, and it needs to run on a J2EE platform. C# and ANTLR are not options.
Dave Jarvis
I doubt I could do it faster than 6 minutes with ANTLR. ;-)
Dave Jarvis
ANTLR is available for Java too, but if you were able to create a RegEx version in 6 minutes, then I would agree, ANTLR is out. :P
jrista
A: 

Solution

A working solution follows:

public class Test {
  public Test() {
  }

  public static void main( String[] args ) {
    String in = "= Line 1 = Line 2 > = Line 2.1 = Line 2.2 < = Line 3";

    in = in.replaceAll( "= ([^=<>]*)", "<li>$1</li>" );
    in = in.replace( ">> ", "><ul>" );
    in = in.replace( ">< ", "></ul>" );
    in = "<ul>" + in + "</ul>";
    System.out.println( in );
  }
}

This creates the desired XHTML fragment:

<ul><li>Line 1 </li><li>Line 2 </li><ul><li>Line 2.1 </li><li>Line 2.2 </li></ul><li>Line 3</li></ul>
Dave Jarvis