views:

117

answers:

3

Hi,

I'm trying to write a helper method that breaks down path expressions and would love some help. Please consider a path pattern like the following four (round brackets indicate predicates):

  1. item.sub_element.subsubelement(@key = string) ; or,
  2. item..subsub_element(@key = string) ; or,
  3. //subsub_element(@key = string) ; or,
  4. item(@key = string)

what would a regular expression look like that matches those?

What I have come up with is this:

 ((/{2}?[\\w+_*])(\\([_=@#\\w+\\*\\(\\)\\{\\}\\[\\]]*\\))?\\.{0,2})+

I'm reading this as: "match one or more occurences of a string that consists of two groups: group one consists of one or more words with optional underscores and optional double forward slash prefix ; group two is optional and consists of at least one word with all other characters optional ; groups are trailed by zero to two dots."

However, a test run on the fourth example with Matcher.matches() returns false. So, where's my error?

Any ideas?

TIA,

FK

Edit: from trying with http://www.regexplanet.com/simple/index.html it seems I wasn't aware of the difference between the Matcher.matches() and the Matcher.find() methods of the Matcher object. I was trying to break down the input string in to substrings that match my regex. Consequently I need to use find(), not matches().

Edit2: This does the trick

([a-zA-Z0-9_]+)\.{0,2}(\(.*\))?

+3  A: 

You may find this website useful for testing your regex's http://www.fileformat.info/tool/regex.htm.

As a general approach, try building the regex up from one that handles a simple case, write some tests and get it to pass. Then make the regex more complicated to handle the other cases as well. Make sure it passes both the original and the new tests.

Tarski
Will do. Thanks.
FK82
A: 

There are so many things wrong with your pattern:

/{2}?: what do you think ? means here? Because if you think it makes /{2} optional, you're wrong. Instead ? is a reluctant modifier for the {2} repetition. Perhaps something like (?:/{2})? is what you intend.

[\w+_*] : what do you think the + and * means here? Because if you think they represent repetition, you're wrong. This is a character class definition, and + and * literally means the characters + and *. Perhaps you intend... actually I'm not sure what you intend.


Solution attempt

Here's an attempt at guessing what your spec is:

    String PART_REGEX =
        "(word)(?:<<@(word) = (word)>>)?"
            .replace("word", "\\w+")
            .replace(" ", "\\s*")
            .replace("<<", "\\(")
            .replace(">>", "\\)");
    Pattern entirePattern = Pattern.compile(
        "(?://)?part(?:\\.{1,2}part)*"
            .replace("part", PART_REGEX)
    );
    Pattern partPattern = Pattern.compile(PART_REGEX);

Then we can test it as follows:

    String[] tests = {
        "item.sub_element.subsubelement(@key = string)",
        "item..subsub_element(@key = string)",
        "//subsub_element(@key = string)",
        "item(@key = string)",
        "one.dot",
        "two..dots",
        "three...dots",
        "part1(@k1=v1)..part2(@k2=v2)",
        "whatisthis(@k=v1=v2)",
        "noslash",
        "/oneslash",
        "//twoslashes",
        "///threeslashes",
        "//multiple//double//slashes",
        "//multiple..double..dots",
        "..startingwithdots",
    };
    for (String test : tests) {
        System.out.println("[ " + test + " ]");
        if (entirePattern.matcher(test).matches()) {
            Matcher part = partPattern.matcher(test);
            while (part.find()) {
                System.out.printf("  [%s](%s => %s)%n",
                    part.group(1),
                    part.group(2),
                    part.group(3)
                );
            }
        }
    }

The above prints:

[ item.sub_element.subsubelement(@key = string) ]
  [item](null => null)
  [sub_element](null => null)
  [subsubelement](key => string)
[ item..subsub_element(@key = string) ]
  [item](null => null)
  [subsub_element](key => string)
[ //subsub_element(@key = string) ]
  [subsub_element](key => string)
[ item(@key = string) ]
  [item](key => string)
[ one.dot ]
  [one](null => null)
  [dot](null => null)
[ two..dots ]
  [two](null => null)
  [dots](null => null)
[ three...dots ]
[ part1(@k1=v1)..part2(@k2=v2) ]
  [part1](k1 => v1)
  [part2](k2 => v2)
[ whatisthis(@k=v1=v2) ]
[ noslash ]
  [noslash](null => null)
[ /oneslash ]
[ //twoslashes ]
  [twoslashes](null => null)
[ ///threeslashes ]
[ //multiple//double//slashes ]
[ //multiple..double..dots ]
  [multiple](null => null)
  [double](null => null)
  [dots](null => null)
[ ..startingwithdots ]

Attachments

polygenelubricants
I actually stated my intention. I sure hope that my being not an expert is no hindrance to posting a question. '+' and '*' are reserved characters I suppose, so they would need to be escape if I would want want them to be captured as literals.
FK82
@FK82: I'm trying my best to help you. It's just close to impossible right now. Maybe others can figure out what you need, though.
polygenelubricants
Well thanks for your post, I'll look into it. Just as a remark though---with no offense in mind---if you do not understand my issue, and consequently helping out is impossible, why do you bother? It's a little mind-boggling.
FK82
@FK82: Questions on stackoverflow.com often are inprecise, hard to understand, or, in some cases, utter gibberish. =) I think it is a sign of a great community that people try to help regardless.
Jens
@ Jens: I agree. Thanks again.
FK82
+2  A: 

You misunderstand character classes, I think. I've found that for testing regular expressions, http://gskinner.com/RegExr/ is of great help. As a tutorial for regular expressions, I'd recommend http://www.regular-expressions.info/tutorial.html.

I am not entirely sure, how you want to group your strings. Your sentence seems to suggest, that your first group is just the item part of item..subsub_element(@key = string), but then I am not sure what the second group should be. Judging from what I deduce from your Regex, I'll just group the part before the brackets into group one, and the part in the brackets into group two. You can surely modify this if I misunderstood you.

I don't escape the expression for Java, so you'd have to do that. =)

The first group should begin with an optional double slash. I use (?://)?. Here ?: means that this part should not be captured, and the last ? makes the group before it optional.

Following that, there are words, containing characters and underscores, grouped by dots. One such word (with trailing dots) can be represented as [a-zA-Z_]+\.{0,2}. The \w you use actually is a shortcut for [a-zA-Z0-9_], I think. It does NOT represent a word, but a "word character".

This last expression may be present multiple times, so the capturing expression for the first group looks like

((?://)?(?:[a-zA-Z_]+\.{0,2})+)

For the part in the brackets, one can use \([^)]*\), which means an opening bracket (escaped, since it has special meaning, followed by an arbitrary number of non-brackets (not escaped, sind it has no special meaning inside a character class), and then a closing bracket.

Combined with ^ and $ to mark the beginning and end of line respectively, we arrive at

^((?://)?(?:[a-zA-Z_]+\.{0,2})+)(\([^)]*\))$

If I misunderstood your requirements, and need help with those, please ask in the comments.

Jens
+1 for the effort. And yes, OP certainly misunderstood quite a few things.
polygenelubricants
@ Jens: nearly perfect. Thanks for the links, regexes are kind of hard to learn. Another addition: actually the expression in brackets may contain brackets too (for nested predicates). More specifically I want to allow `[a-zA-Z0-9_@=\\(\\)\{\}\\[\\]]*` . Would, `^((?://)?(?:[a-zA-Z_]+\.{0,2})+)(\([a-zA-Z0-9_@=\\(\\)\{\}\\[\\]]*\))$` possibly work too?
FK82
@FK82: You do not need to escape the round and curly brackets and some regex implementations don't like when you do. Dunno about Java. You need to escape the outer brackets, though. Probably `(\(.*\))` would be ok for your second group.
Jens
@ Jens: Alright, thanks again.
FK82
Btw, brackets generally must be escaped in Java judging from the `Pattern` documentation (http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html)
FK82
Inside a character class, parentheses and braces don't need to be escaped; `[a-zA-Z0-9_@=(){}\\[\\]]` works the same as `[a-zA-Z0-9_@=\\(\\)\{\}\\[\\]]`. In most flavors you wouldn't have to escape the left square bracket either, but in Java you do.
Alan Moore
@ Alan Moore: +1 for helpful comment.
FK82