views:

79

answers:

3

I'm trying to parse the configuration files usually found in /etc/default using java and regular expressions. So far this is the code I have iterating over every line on each file:

// remove comments from the line
int hash = line.indexOf("#");
if (hash >= 0) {
    line = line.substring(0, hash);
}

// create the patterns
Pattern doubleQuotePattern = Pattern.compile("\\s*([a-zA-Z_][a-zA-Z_0-9]*)\\s*=\\s*\"(.*)\"\\s*");
Pattern singleQuotePattern = Pattern.compile("\\s*([a-zA-Z_][a-zA-Z_0-9]*)\\s*=\\s*\\'(.*)\\'\\s*");
Pattern noQuotePattern = Pattern.compile("\\s*([a-zA-Z_][a-zA-Z_0-9]*)\\s*=(.*)");

// try to match each of the patterns to the line
Matcher matcher = doubleQuotePattern.matcher(line);
if (matcher.matches()) {
    System.out.println(matcher.group(1) + " == " + matcher.group(2));
} else {
    matcher = singleQuotePattern.matcher(line);
    if (matcher.matches()) {
        System.out.println(matcher.group(1) + " == " + matcher.group(2));
    } else {
        matcher = noQuotePattern.matcher(line);
        if (matcher.matches()) {
            System.out.println(matcher.group(1) + " == " + matcher.group(2));
        }
    }
}

This works as I expect but I'm pretty sure that I can make this way smaller by using better regular expression but I haven't had any luck. Anyone know of a better way to read these types of files?

+2  A: 

You can use antlr to generate a parser. Basically you write a grammar for the language you want to work with (or use one of the many grammars already written and antlr will generate a parser for you.

Guillaume
I believe that a simple regular expression should be more than enough. I haven't been able to do it right using the (X|Y|Z) construct and striping the double or single quotes.
rmarimon
+1  A: 

In many cases you can use java.util.Properties to process shell configuration files.

Actually, if you don't make these files overly complex you can share them this way between shell scripts and java programs.

Things that do not process really well are the quoted strings.

Alexander Pogrebnyak
The quoted strings is the exact problem I have. I might use the properties files and then go over the values and remove the quotes but that seems hacky...
rmarimon
+1  A: 

Here is a single Pattern you can use that is equivalent to the three you have above:

Pattern etcPattern = Pattern.compile(
   "\\s*([a-zA-Z_]\\w*)\\s*=\\s*"+
   "(\"|'|.{0,0})(.*?)\\2"+  //QUOTE MATCHING
   "\\s*");

There are three differences between this and yours: first I replaced the expression [a-zA-Z0-9_] with the its predefined character class \w (a word character). The second part (QUOTE MATCHING) is a pattern that will match and strip outer balanced quotes, but also allow unbalanced quotes as your three patterns did.

It begins by using the pattern (\"|'|.{0,0}). This is

  1. A double quote
  2. A single quote
  3. Anything zero times

Then your .* pattern followed by a backreference \2. The backreference says to match what was matched by pattern 2 (the quote pattern). This is where the third case above is important. If the value does not begin with a single or double quote, it needs to be able to ignore it. So it begins by attempting to match one of the quotes. If it can't then it will match the empty string, which in turn allows the backreference to match the empty string.

The final change to make it work is to change the internal .* pattern to be reluctant (to .*?) so that it will allow the quotes to be matched by the back reference if possible and be stripped.

So you should be able to run this as:

Matcher matcher = etcPattern.matcher(line);
if (matcher.matches()) {
    System.out.println(matcher.group(1) + " == " + matcher.group(3));
}

equivalently to your example above (note the value is in match group 3 now instead of two. As I said this matched what your patterns did, specifically it will allow unbalanced quotes, and allow any internal quoting to the value.

M. Jessup
Great... it works beautifully. This is why I like SO so much. Great people writing great code.
rmarimon
Happy it was what you were looking for, quick note the pattern does allow unbalanced quotes (I originally had a typo saying it didn't, which I have fixed.)
M. Jessup