views:

262

answers:

2

I've written a url validator for a project I am working on. For my requirements it works great, except when the last part for the url goes longer than 22 characters it breaks. My expression:

/((https?):\/\/)([^\s.]+.)+([^\s.]+)(:\d+\/\S+)/i

It expects input that looks like "http(s)://hostname:port/location". When I give it the input:

https://demo10:443/111112222233333444445

it works, but if I pass the input

https://demo10:443/1111122222333334444455

it breaks. You can test it out easily at http://ryanswanson.com/regexp/#start. Oddly, I can't reproduce the problem with just the relevant (I would think) part /(:\d+\/\S+)/i. I can have as many characters after the required / and it works great. Any ideas or known bugs?

Edit: Here is some code for a sample application that demonstrates the problem:

<mx:Application xmlns:mx="http://www.adobe.com/2006/mxml" layout="absolute">
<mx:Script>
    <![CDATA[
        private function click():void {
             var value:String = input.text;
             var matches:Array = value.match(/((https?):\/\/)([^\s.]+.)+([^\s.]+)(:\d+\/\S+)/i);
             if(matches == null || matches.length < 1 || matches[0] != value) {
                area.text = "No Match";
             }
             else {
                area.text = "Match!!!";
             }
        }
    ]]>
</mx:Script>
<mx:TextInput x="10" y="10" id="input"/>
<mx:Button x="178" y="10" label="Button" click="click()"/>
<mx:TextArea x="10" y="40" width="233" height="101" id="area"/>
</mx:Application>
+1  A: 

This is a bug, either in Ryan's implementation or within Flex/Flash.

The regular expression syntax used above (less surrounding slashes and flags) matches Python which provides the following output:

# ignore case insensitive flag as it doesn't matter in this case
>>> import re
>>> rx = re.compile('((https?):\/\/)([^\s.]+.)+([^\s.]+)(:\d+\/\S+)')
>>> print rx.match('https://demo10:443/1111122222333334444455').groups()
('https://', 'https', 'demo1', '0', ':443/1111122222333334444455')
Kaleb Pederson
Its definitely not just his implementation as it doesn't work in my code either.
Tommy
Interesting. I noticed that Ryan's implementation started slowing down the longer and longer the url got, so I wonder its a problem with the regex analysis algorithm. If you have a working code sample, please paste.
Kaleb Pederson
+1  A: 

I debugged your regular expression on RegexBuddy and apparently it takes millions of steps to find a match. This usually means that something is terribly wrong with the regular expression.

Look at ([^\s.]+.)+([^\s.]+)(:\d+\/\S+).

1- It seems like you're trying to match subdomains too, but it doesn't work as intended since you didn't escape the dot. If you escape it, demo10:443/123 won't match because it'll need at least one dot. Change ([^\s.]+\.)+ to ([^\s.]+\.)* and it'll work.

2- [^\s.]+ is a bad character class, it will match the whole string and start backtracking from there. You can avoid this by using [^\s:.] which will stop at the colon.

This one should work as you want: https?:\/\/([^\s:.]+\.)*([^\s:.]+):\d+\/\S+

tiftik
Works great, thanks! You did forget the slash after the d+, but no worries. https?:\/\/([^\s:.]+\.)*([^\s:.]+):\d+\/\S+
Tommy