tags:

views:

87

answers:

3

I'm trying to parse an HTTP GET request to determine if the url contains any of a number of file types. If it does, I want to capture the entire request. There is something I don't understand about ORing.

The following regular expression only captures part of it, and only if .flv is the first int the list of ORd values.

(I've obscured the urls with spaces because Stackoverflow limits hyperlinks)

regex:

GET.*?(\.flv)|(\.mp4)|(\.avi).*?

test text:

GET http: // foo.server.com/download/0/37/3000016511/.flv?mt=video/xy

match output:

GET http: // foo.server.com/download/0/37/3000016511/.flv

I don't understand why the .*? at the end of the regex isnt callowing it to capture the entire text. If I get rid of the ORing of file types, then it works.

Here is the test code in case my explanation doesn't make sense:

 public static void main(String[] args) {
  // TODO Auto-generated method stub
  String sourcestring = "GET http: // foo.server.com/download/0/37/3000016511/.flv?mt=video/xy";
  Pattern re = Pattern.compile("GET .*?\\.flv.*");  // this works
    //output:
    // [0][0] = GET http :// foo.server.com/download/0/37/3000016511/.flv?mt=video/xy

  // the match from the following ends with the ".flv", not the entire url.
  // also it only works if .flv is the first of the 3 ORd options
  //Pattern re = Pattern.compile("GET .*?(\\.flv)|(\\.mp4)|(\\.avi).*?");
   // output:
   //[0][0] = GET http: // foo.server.com/download/0/37/3000016511/.flv
   // [0][1] = .flv
   // [0][2] = null
   // [0][3] = null

Matcher m = re.matcher(sourcestring);
int mIdx = 0;
  while (m.find()){
    for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
      System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
    }
    mIdx++;
  }

} }

+4  A: 

You have your grouping wrong. The | needs to be inside the parentheses:

GET.*?(\.flv|\.mp4|\.avi).*?

I'm also not sure why you have the ? on the end of the final .*?. In most languages, the ? here makes the * non-greedy, so it matches as few characters as possible, while not preventing the pattern from matching. In this case that would mean it matches no characters, since nothing follows it, so you probably want to remove that final ?.

GET .*?(\.flv|\.mp4|\.avi).*
Andy Mortimer
+1. It never makes sense to have a reluctant quantifier as the last thing in a regex. Not that it mattered in this case; thanks to faulty grouping, that part of the regex was never even reached.
Alan Moore
It's possible I started with that. Anyway, I tried it and now get no matches, either at myregextester dot com or in the java code.
Hmm, it works for me at myregextester.com, with source text "GET http: // foo.server.com/download/0/37/3000016511/.flv?mt=video/xy" (with the spaces around the // removed) and regex "GET .*?(\.flv|\.mp4|\.avi).*". I get two groups, one with the whole string, and the other containing just the extension.
Andy Mortimer
Yeh. It's working for me now, too. I must have fat-fingered it somehow before. Thanks much for the help!
A: 

First of all, your regex reads like this:

GET.*?(\.flv)  |  (\.mp4)  |  (\.avi).*?

(spaces added for clarity). Try it like this:

GET.*?(\.flv|\.mp4|\.avi).*?
Jakob Kruse
A: 

As an aside, I find this tool to be really great when working with regular expressions: The Regex Coach

Rippy