views:

84

answers:

2
+1  Q: 

Regex explanation

I am looking at the code in the tumblr bookmarklet and was curious what the code below did.

try{
    if(!/^(.*\.)?tumblr[^.]*$/.test(l.host))
        throw(0);
        tstbklt();
}

Can anyone tell me what the if line is testing? I have tried to decode the regex but have been unable to do so.

+1  A: 

My attempt to break it down. I'm no expert with regex however:

if(!/^(..)?tumblr[^.]$/.test(l.host))

This part isn't really regex but tells us to only execute the if() if this test does not work.

if(!/^(.*\.)?tumblr[^.]*$/.test(l.host))

This part allows for any characters before the tumblr word as long as they are followed by a . But it is all optional (See the ? at the end)

if(!/^(.*.)?tumblr**[^.]*$/**.test(l.host))

Next, it matches any character except the . and it the *$ extends that to match any character afterwards (so it doesn't break after 1) and it works until the end of the string.

Finally, the .test() looks to test it against the current hostname or whatever l.host contains (I'm not familiar with the tumblr bookmarklet)

So basically, it looks like that part is checking to see that if the host is not part of tumblr, then throw that exception.

Looking forward to see how wrong I am :)

Bartek
+5  A: 

Initially excluding the specifics of the regex, this code is:

if ( ! /.../.test(l.host) )

"if not regex.matches(l.host)" or "if l.host does not match this regex"

So, the regex must correctly describe the contents of l.host text for the conditional to fail and thus avoid throwing the error.

On to the regex itself:

^(.*\.)?tumblr[^.]*$

This is checking for the existence of tumblr but only after any string ending in . that might exist:

^       # start of line
(       # begin capturing group 1
.*      # match any (non-newline) character, as many times as possible, but zero allowed
\.      # match a literal .
)       # end capturing group 1
?       # make whole preceeding item optional
tumblr  # match literal text tumblr
[^.]*   # match any non . character, as many times as possible, but zero allowed
$       # match end of line


I thought it was testing to see if the host was tumblr

Yeah, it looked like it might be intended to check that, but if so it's the wrong way to do it.
For that, the first bit should be something like ^(?:[\w-]+\.)? to capture an alphanumeric subdomain (the ?: is a non-capturing group, the [\w-]+ is at least 1 alphanumeric, underscore or hyphen) and the last bit should be either \.(?:com|net|org)$ or perhaps like (?:\.[a-zA-Z]+)+$ depending on how flexible the tld section might need to be.

Peter Boughton
Great way to show the process. Way better than mine. You earn my upvote!
Bartek
I thought it was testing to see if the host was tumblr but when I tested it (by running the bookmarklet while being on tumblr.com) I didn't see that it acted any different.
SonnyBurnette
Yeah, it certainly looked like it might be trying to doing that, but if that is the intent it's wrong. I've added a quick bit extra about that, (though real URL validation is probably a bit more complex).
Peter Boughton
Thanks for the follow up. That must be why it doesn't act any different when I checked it on their site. With your new way to do it, does it require some kind of string before the domain.com? if the user was on just tumblr.com (without www) would it pass/fail then?
SonnyBurnette
Nope, it's still got the `?` after the group to make the subdomain part optional. Just saw I didn't post the complete updated version in my question, so here it is here: `^(?:[\w-]+\.)?tumblr\.(?:com|net|org)$`
Peter Boughton