tags:

views:

25

answers:

1

I'm trying to extract a page name and query string from a URL which should not contain .html

Here is an example code in Java:

public class TestRegex { 
    public static void main(String[] args) {
        Pattern pattern = Pattern.compile("/test/(((?!\\.html).)+)\\?(.+)");
        Matcher matcher = pattern.matcher("/test/page?param=value");
        System.out.println(matcher.matches());
        System.out.println(matcher.group(1));
        System.out.println(matcher.group(2));
    }
}

By running this code one can get the following output:

true
page
e

What's wrong with my regex so the second group contains the letter e instead of param=value?

+2  A: 

You're doing:

Pattern.compile("/test/(((?!\\.html).)+)\\?(.+)")
//                     ^^            ^ ^   ^  ^
//                     ||            | |   |  |
//                     |+------2-----+ |   +-3+
//                     |               |  
//                     +-------1-------+                  

Try:

Pattern.compile("/test/((?:(?!\\.html).)+)\\?(.+)")
//                     ^                 ^   ^  ^
//                     |                 |   |  |
//                     |                 |   +-2+
//                     |                 |  
//                     +--------1--------+  

In other words: (?:...) makes it a non-capturing group.

Bart Kiers
wow! amazingly comprehensive!
Ivan Yatskevich
@Ivan, good to hear that! :)
Bart Kiers