ansaurus

Question

Regexp matching a string pattern surrounded by capital letters

Answer 1

A:

To match uppercase letters you simply need [A-Z]. Then build around that the rest of your rules. Without knowing the exactly what you mean by "website" and "pattern" it is difficult to give better guidance.

This expression will match if uppercase characters are both between "website" and "pattern" as well as after "pattern"

^http://website/.*[A-Z]+.*/pattern/.*[A-Z]+.*$

This expression will bath on either uppercase-case

^http://website/(.*[A-Z]+.*/pattern/.*[A-Z]+.*|.*[A-Z]+.*/pattern/.*|.*/pattern/.*[A-Z]+.*)$

UPDATE:

To @TokenMacGuy's point, RegEx parsing of URLs can be very tricky. If you want to break into parts and then validate, you can start with this expression which should match and group most* URLs.

(?<protocol>(http|ftp|https|ftps):\/\/)?(?<site>[\w\-_\.]+\.(?<tld>([0-9]{1,3})|([a-zA-Z]{2,3})|(aero|arpa|asia|coop|info|jobs|mobi|museum|name|travel))+(?<port>:[0-9]+)?\/?)((?<resource>[\w\-\.,@^%:/~\+#]*[\w\-\@^%/~\+#])(?<queryString>(\?[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*=[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*)+(&[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*=[a-zA-Z0-9\[\]\-\._+%\$#\~',/]*)*)?)?

*it worked in all my tests, but I can't claim I was exhaustive.

Brad 2010-10-20 20:29:01

@brad, the website is a website such as "myintranet.mycompany.com", and the pattern is a common folder name such as "upload"

Mario 2010-10-20 20:34:14

The path component of URLs can be case-sensitive. Host names are not.

novalis 2010-10-20 20:44:13

@novalis, duely noted, and edited.

Brad 2010-10-20 20:49:07

@brad: thanks a lot, it works perfectly! It is an indexing system and I can't do more than adding some regular expressions to identify these invalid URLs - the correct ones are all lowercase.

Mario 2010-10-20 21:00:40

Answer 2

+1 A:

I'd advise against doing the two things you are describing with a regular expression in one step. Use a url parsing library to extract the path and hostname components separately. You want to do this for a couple of reasons, There can be some surprising stuff in the host portion of the url that can throw you off, for instance, the hostname of

http://website@otherweb/uppeRcase/pattern/upperCase

is actually otherweb, and should be excluded, even though it begins with website. similarly:

http://website/actual/path/component?uppeRcase/pattern/upperCase

should be excluded, even though the url has the pattern, surrounded by upper case path components, because the matching region is not part of the path.

http://website/uppe%52case/%70attern/upper%43ase

is actually the same resource as your first example, but contains escapes that might prevent a regex from noticing it.

Once you've extracted and converted the escape sequences of just the path component, though, a regex is probably a great tool to use.

TokenMacGuy 2010-10-20 20:35:55

@TokenMacGuy: very good remarks, fortunately enough in my case the URLs are consistents. It is true that the second case is something that can happen, thanks a lot for that!

Mario 2010-10-20 21:05:53

ansaurus

tags:

views:

answers:

Regexp matching a string pattern surrounded by capital letters

related questions