tags:

views:

45

answers:

2

Hi,

I would need one or more regular expressions to match some invalid urls of a website, that have uppercase letters before OR after a certain pattern.

These are the structure rules to match the invalid URLs:

  • a defined website
  • zero, or more uppercase letters if zero uppercase letters after the pattern
  • a pattern
  • zero, or more uppercase letters if zero uppercase letters before the pattern

To be explicit with examples:

http://website/uppeRcase/pattern/upperCase         // match it, uppercase before and after pattern
http://otherweb/WhatevercAse/pattern/whatevercase  // do not match, no website
http://website/lowercase/pattern/lowercase         // do not match, no uppercase before or after pattern
http://website/lowercase/pattern/uppercasE         // match it, uppercase after pattern
http://website/Uppercase/pattern/lowercase         // match it, uppercase before pattern
http://website/WhatevercAse/asdasd/whatEveRcase    // do not match it, no pattern

Thanks in advance for your help!

Mario

A: 

To match uppercase letters you simply need [A-Z]. Then build around that the rest of your rules. Without knowing the exactly what you mean by "website" and "pattern" it is difficult to give better guidance.

This expression will match if uppercase characters are both between "website" and "pattern" as well as after "pattern"

^http://website/.*[A-Z]+.*/pattern/.*[A-Z]+.*$

This expression will bath on either uppercase-case

^http://website/(.*[A-Z]+.*/pattern/.*[A-Z]+.*|.*[A-Z]+.*/pattern/.*|.*/pattern/.*[A-Z]+.*)$


UPDATE:

To @TokenMacGuy's point, RegEx parsing of URLs can be very tricky. If you want to break into parts and then validate, you can start with this expression which should match and group most* URLs.

(?<protocol>(http|ftp|https|ftps):\/\/)?(?<site>[\w\-_\.]+\.(?<tld>([0-9]{1,3})|([a-zA-Z]{2,3})|(aero|arpa|asia|coop|info|jobs|mobi|museum|name|travel))+(?<port>:[0-9]+)?\/?)((?<resource>[\w\-\.,@^%:/~\+#]*[\w\-\@^%/~\+#])(?<queryString>(\?[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*=[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*)+(&[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*=[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*)*)?)?

*it worked in all my tests, but I can't claim I was exhaustive.

Brad
@brad, the website is a website such as "myintranet.mycompany.com", and the pattern is a common folder name such as "upload"
Mario
The path component of URLs can be case-sensitive. Host names are not.
novalis
@novalis, duely noted, and edited.
Brad
@brad: thanks a lot, it works perfectly! It is an indexing system and I can't do more than adding some regular expressions to identify these invalid URLs - the correct ones are all lowercase.
Mario
+1  A: 

I'd advise against doing the two things you are describing with a regular expression in one step. Use a url parsing library to extract the path and hostname components separately. You want to do this for a couple of reasons, There can be some surprising stuff in the host portion of the url that can throw you off, for instance, the hostname of

http://website@otherweb/uppeRcase/pattern/upperCase

is actually otherweb, and should be excluded, even though it begins with website. similarly:

http://website/actual/path/component?uppeRcase/pattern/upperCase

should be excluded, even though the url has the pattern, surrounded by upper case path components, because the matching region is not part of the path.

http://website/uppe%52case/%70attern/upper%43ase

is actually the same resource as your first example, but contains escapes that might prevent a regex from noticing it.

Once you've extracted and converted the escape sequences of just the path component, though, a regex is probably a great tool to use.

TokenMacGuy
@TokenMacGuy: very good remarks, fortunately enough in my case the URLs are consistents. It is true that the second case is something that can happen, thanks a lot for that!
Mario