I came up with a partial solution using an XML tool (XOM, http://www.xom.nu) to hold the tree. First the code, then an example parse. First the escaped characters (\ , ( and ) ) are de-escaped (here I use BS, LB and RB), then remaining brackets are translated to XML tags, then the XML is parsed and the characters re-escaped. What is needed further is a BNF for Java 1.6 regexes doe quantifiers such as ?:, {d,d} and so on.
public static Element parseRegex(String regex) throws Exception {
regex = regex.replaceAll("\\\\", "BS");
regex.replaceAll("BS\\(", "LB");
regex.replaceAll("BS\\)", "RB");
regex = regex.replaceAll("\\(", "<bracket>");
regex.replaceAll("\\)", "</bracket>");
Element regexX = new Builder().build(new StringReader(
"<regex>"+regex+"</regex>")).getRootElement();
extractCaptureGroupContent(regexX);
return regexX;
}
private static String extractCaptureGroupContent(Element regexX) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < regexX.getChildCount(); i++) {
Node childNode = regexX.getChild(i);
if (childNode instanceof Text) {
Text t = (Text)childNode;
String s = t.getValue();
s = s.replaceAll("BS", "\\\\").replaceAll("LB",
"\\(").replaceAll("RB", "\\)");
t.setValue(s);
sb.append(s);
} else {
sb.append("("+extractCaptureGroupContent((Element)childNode)+")");
}
}
String capture = sb.toString();
regexX.addAttribute(new Attribute("capture", capture));
return capture;
}
example:
@Test
public void testParseRegex2() throws Exception {
String regex = "(.*(\\(b\\))c(d(e)))";
Element regexElement = ParserUtil.parseRegex(regex);
CMLUtil.debug(regexElement, "x");
}
gives:
<regex capture="(.*((b))c(d(e)))">
<bracket capture=".*((b))c(d(e))">.*
<bracket capture="(b)">(b)</bracket>c
<bracket capture="d(e)">d
<bracket capture="e">e</bracket>
</bracket>
</bracket>
</regex>