ansaurus

Question

How do I convert a regular expression in a valid .NET format to valid Java format?

Answer 1

+1 A:

Named groups are done differently in .NET than in all the other Regex flavors. You have:

(?<Domain>pattern)

Java (and everyone else) expects:

(?P<Domain>pattern)

Rex M 2009-08-02 01:39:24

Java doesn't have named groups at all, yet. They're being added in Java 7.

Alan Moore 2009-08-02 05:18:08

Answer 2

+2 A:

Java does not have the @ string notation. So, make sure you escape all the '\' in your regexp. (\w+ becomes> \\w+, \/ becomes> \\/, \x21 becomes> \\x21, etc. )

Michael 2009-08-02 01:50:01

Answer 3

+1 A:

The most direct translation would be:

Pattern p = Pattern.compile(
  "\\w+://([\\x21-\\x22\\x24-\\x2E\\x30-\\x3A\\x40-\\x5A\\x5F\\x61-\\x7A]+)(/?\\S*)",
  Pattern.CASE_INSENSITIVE | Pattern.DOTALL);

Java has no equivalent for C#'s verbatim strings, so you always have to escape backslashes. And Java's regexes don't support named groups, so I converted those to simple capturing groups (named groups are due to be added in Java 7).

But there are a few problems with the original regex:

The RegexOptions.Compiled modifier doesn't do what you probably think it does. Specifically, it's not related to Java's compile() method; that's just a factory method, roughly equivalent to C#'s new Regex() constructor. The Compiled modifier causes the regex to be compiled to CIL bytecode, which can make it match a lot faster, but at a considerable cost in upfront processing and memory use--and that memory never gets garbage-collected. If you don't use the regex a lot, the Compiled option is probably doing more harm than good, performance-wise.
The IgnoreCase/CASE_INSENSITIVE modifier is pointless since your regex always matches both upper- and lowercase variants wherever it matches letters.
The Singleline/DOTALL modifier is pointless since you never use the dot metacharacter.
In .NET regexes, the character-class shorthand \w is Unicode-aware, equivalent to [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}]. In Java it's ASCII-only -- [A-Za-z0-9_]-- which seems to be more in line with the way you're using it (you could "dumb it down" in .NET by using the RegexOptions.ECMAScript modifier).

So the actual translation would be more like this:

Pattern p = Pattern.compile("\\w+://([\\w!\"$.:@]+)(?:/(\\S*))?");

Alan Moore 2009-08-02 08:12:19

ansaurus

tags:

views:

answers:

How do I convert a regular expression in a valid .NET format to valid Java format?

related questions