ansaurus

Question

How do I write a regular expression for these path expressions.

Answer 1

+3 A:

You may find this website useful for testing your regex's http://www.fileformat.info/tool/regex.htm.

As a general approach, try building the regex up from one that handles a simple case, write some tests and get it to pass. Then make the regex more complicated to handle the other cases as well. Make sure it passes both the original and the new tests.

Tarski 2010-07-01 09:33:44

Will do. Thanks.

FK82 2010-07-01 10:21:15

Answer 2

A:

There are so many things wrong with your pattern:

/{2}?: what do you think ? means here? Because if you think it makes /{2} optional, you're wrong. Instead ? is a reluctant modifier for the {2} repetition. Perhaps something like (?:/{2})? is what you intend.

[\w+_*] : what do you think the + and * means here? Because if you think they represent repetition, you're wrong. This is a character class definition, and + and * literally means the characters + and *. Perhaps you intend... actually I'm not sure what you intend.

Solution attempt

Here's an attempt at guessing what your spec is:

    String PART_REGEX =
        "(word)(?:<<@(word) = (word)>>)?"
            .replace("word", "\\w+")
            .replace(" ", "\\s*")
            .replace("<<", "\\(")
            .replace(">>", "\\)");
    Pattern entirePattern = Pattern.compile(
        "(?://)?part(?:\\.{1,2}part)*"
            .replace("part", PART_REGEX)
    );
    Pattern partPattern = Pattern.compile(PART_REGEX);

Then we can test it as follows:

    String[] tests = {
        "item.sub_element.subsubelement(@key = string)",
        "item..subsub_element(@key = string)",
        "//subsub_element(@key = string)",
        "item(@key = string)",
        "one.dot",
        "two..dots",
        "three...dots",
        "part1(@k1=v1)..part2(@k2=v2)",
        "whatisthis(@k=v1=v2)",
        "noslash",
        "/oneslash",
        "//twoslashes",
        "///threeslashes",
        "//multiple//double//slashes",
        "//multiple..double..dots",
        "..startingwithdots",
    };
    for (String test : tests) {
        System.out.println("[ " + test + " ]");
        if (entirePattern.matcher(test).matches()) {
            Matcher part = partPattern.matcher(test);
            while (part.find()) {
                System.out.printf("  [%s](%s => %s)%n",
                    part.group(1),
                    part.group(2),
                    part.group(3)
                );
            }
        }
    }

The above prints:

[ item.sub_element.subsubelement(@key = string) ]
  [item](null => null)
  [sub_element](null => null)
  [subsubelement](key => string)
[ item..subsub_element(@key = string) ]
  [item](null => null)
  [subsub_element](key => string)
[ //subsub_element(@key = string) ]
  [subsub_element](key => string)
[ item(@key = string) ]
  [item](key => string)
[ one.dot ]
  [one](null => null)
  [dot](null => null)
[ two..dots ]
  [two](null => null)
  [dots](null => null)
[ three...dots ]
[ part1(@k1=v1)..part2(@k2=v2) ]
  [part1](k1 => v1)
  [part2](k2 => v2)
[ whatisthis(@k=v1=v2) ]
[ noslash ]
  [noslash](null => null)
[ /oneslash ]
[ //twoslashes ]
  [twoslashes](null => null)
[ ///threeslashes ]
[ //multiple//double//slashes ]
[ //multiple..double..dots ]
  [multiple](null => null)
  [double](null => null)
  [dots](null => null)
[ ..startingwithdots ]

Attachments

Source code and output on ideone.com

polygenelubricants 2010-07-01 09:58:29

I actually stated my intention. I sure hope that my being not an expert is no hindrance to posting a question. '+' and '*' are reserved characters I suppose, so they would need to be escape if I would want want them to be captured as literals.

FK82 2010-07-01 10:14:46

@FK82: I'm trying my best to help you. It's just close to impossible right now. Maybe others can figure out what you need, though.

polygenelubricants 2010-07-01 10:25:33

Well thanks for your post, I'll look into it. Just as a remark though---with no offense in mind---if you do not understand my issue, and consequently helping out is impossible, why do you bother? It's a little mind-boggling.

FK82 2010-07-01 10:39:55

@FK82: Questions on stackoverflow.com often are inprecise, hard to understand, or, in some cases, utter gibberish. =) I think it is a sign of a great community that people try to help regardless.

Jens 2010-07-01 10:53:50

@ Jens: I agree. Thanks again.

FK82 2010-07-01 11:01:18

Answer 3

+2 A:

You misunderstand character classes, I think. I've found that for testing regular expressions, http://gskinner.com/RegExr/ is of great help. As a tutorial for regular expressions, I'd recommend http://www.regular-expressions.info/tutorial.html.

I am not entirely sure, how you want to group your strings. Your sentence seems to suggest, that your first group is just the item part of item..subsub_element(@key = string), but then I am not sure what the second group should be. Judging from what I deduce from your Regex, I'll just group the part before the brackets into group one, and the part in the brackets into group two. You can surely modify this if I misunderstood you.

I don't escape the expression for Java, so you'd have to do that. =)

The first group should begin with an optional double slash. I use (?://)?. Here ?: means that this part should not be captured, and the last ? makes the group before it optional.

Following that, there are words, containing characters and underscores, grouped by dots. One such word (with trailing dots) can be represented as [a-zA-Z_]+\.{0,2}. The \w you use actually is a shortcut for [a-zA-Z0-9_], I think. It does NOT represent a word, but a "word character".

This last expression may be present multiple times, so the capturing expression for the first group looks like

((?://)?(?:[a-zA-Z_]+\.{0,2})+)

For the part in the brackets, one can use $[^)]*$, which means an opening bracket (escaped, since it has special meaning, followed by an arbitrary number of non-brackets (not escaped, sind it has no special meaning inside a character class), and then a closing bracket.

Combined with ^ and $ to mark the beginning and end of line respectively, we arrive at

^((?://)?(?:[a-zA-Z_]+\.{0,2})+)(\([^)]*\))$

If I misunderstood your requirements, and need help with those, please ask in the comments.

Jens 2010-07-01 10:40:18

+1 for the effort. And yes, OP certainly misunderstood quite a few things.

polygenelubricants 2010-07-01 10:45:57

@ Jens: nearly perfect. Thanks for the links, regexes are kind of hard to learn. Another addition: actually the expression in brackets may contain brackets too (for nested predicates). More specifically I want to allow `[a-zA-Z0-9_@=\$\$\{\}\\[\\]]*` . Would, `^((?://)?(?:[a-zA-Z_]+\.{0,2})+)($[a-zA-Z0-9_@=\\(\$\{\}\\[\\]]*\))$` possibly work too?

FK82 2010-07-01 10:55:45

@FK82: You do not need to escape the round and curly brackets and some regex implementations don't like when you do. Dunno about Java. You need to escape the outer brackets, though. Probably `($.*$)` would be ok for your second group.

Jens 2010-07-01 11:03:40

@ Jens: Alright, thanks again.

FK82 2010-07-01 11:19:43

Btw, brackets generally must be escaped in Java judging from the `Pattern` documentation (http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html)

FK82 2010-07-01 11:25:59

Inside a character class, parentheses and braces don't need to be escaped; `[a-zA-Z0-9_@=(){}\\[\\]]` works the same as `[a-zA-Z0-9_@=\$\$\{\}\\[\\]]`. In most flavors you wouldn't have to escape the left square bracket either, but in Java you do.

Alan Moore 2010-07-02 06:16:28

@ Alan Moore: +1 for helpful comment.

FK82 2010-07-02 13:09:59

ansaurus

tags:

views:

answers:

How do I write a regular expression for these path expressions.

Solution attempt

Attachments

related questions