tags:

views:

116

answers:

3

I am trying to validate a Youtube URL using regex:

preg_match('~http://youtube.com/watch\?v=[a-zA-Z0-9-]+~', $videoLink)

It kind of works, but it can match URL's that are malformed. For example, this will match ok:

http://www.youtube.com/watch?v=Zu4WXiPRek

But so will this:

http://www.youtube.com/watch?v=Zu4WX£&P!ek

And this wont:

http://www.youtube.com/watch?v=!Zu4WX£&P4ek

I think it's because of the + operator. It's matching what seems to be the first character after v=, when it needs to try and match everything behind v= with [a-zA-Z0-9-]. Any help is appreciated, thanks.

+1  A: 

To provide an alternative that is larger and much less elegant than a regex, but works with PHP's native URL parsing functions so it might be a bit more reliable in the long run:

 $url = "http://www.youtube.com/watch?v=Zu4WXiPRek";

 $query_string = parse_url($url, PHP_URL_QUERY); // v=Zu4WXiPRek

 $query_string_parsed = array();                        
 parse_str($query_string, $query_string_parsed); // an array with all GET params

 echo($query_string_parsed["v"]); // Will output Zu4WXiPRek that you can then
                                  // validate for [a-zA-Z0-9] using a regex
Pekka
just want to point out that this is only really useful (and IMO recommended) if you already have just the url...but not really if he's scraping a page for urls...
Crayon Violent
That just seems like added code going back to the original problem. The problem is with validating the string after `v=`, which is what this code extracts. I don't need it extracted, I just need to make sure the rest of the URL after `v=` is matched by `[a-zA-Z0-9-]`.
Will
@Will yeah. This is a more standards-conformant way that can deal with changing URL structures to some extent. For example, it doesn't break when a URL has the popular ` as far as I can see, @lonesomeday answers your speficic question
Pekka
A: 

The problem is that you are not requiring any particular number of characters in the v= part of the URL. So, for instance, checking

http://www.youtube.com/watch?v=Zu4WX£&P!ek

will match

http://www.youtube.com/watch?v=Zu4WX

and therefore return true. You need to either specify the number of characters you need in the v= part:

preg_match('~http://youtube.com/watch\?v=[a-zA-Z0-9-]{10}~', $videoLink)

or specify that the group [a-zA-Z0-9-] must be the last part of the string:

preg_match('~http://youtube.com/watch\?v=[a-zA-Z0-9-]+$~', $videoLink)

Your other example

http://www.youtube.com/watch?v=!Zu4WX£&P4ek

does not match, because the + sign requires that at least one character must match [a-zA-Z0-9-].

lonesomeday
I'm pretty sure the v= part varies, that's why I didn't use that before... and using `[a-zA-Z0-9-]$` didn't work either. It's just returning false for everything.
Will
Thats because it should have been: `[a-zA-Z0-9-]+$` just a typo.
Brad F Jacobs
Ah, savior. Thanks! :)
Will
Fixed -- thanks for the catch, premiso.
lonesomeday
A: 

Short answer:

preg_match('%(http://www.youtube.com/watch\?v=(?:[a-zA-Z0-9-])+)(?:[&"\'\s])%', $videoLink)

There are a few assumptions made here, so let me explain:

  • I added a capturing group ( ... ) around the entire http://www.youtube.com/watch?v=blah part of the link, so that we can say "I want get the whole validated link up to and including the ?v=movieHash"
  • I added the non-capturing group (?: ... ) around your character set [a-zA-Z0-9-] and left the + sign outside of that. This will allow us to match all allowable characters up to a certain point.
  • Most importantly, you need to tell it how you expect your link to terminate. I'm taking a guess for you with (?:[&"\'\s])

    ?) Will it be in html format (e.g. anchor tag) ? If so, the link in href will obviously end with a " or '.
    ?) Or maybe there's more to the query string, so there would be an & after the value of v.
    ?) Maybe there's a space or line break after the end of the link \s.

The important piece is that you can get much more accurate results if you know what's surrounding what you are searching for, as is the case with many regular expressions.

This non-capturing group (in which I'm making assumptions for you) will take a stab at finding and ignoring all the extra junk after what you care about (the ?v=awesomeMovieHash).

Results:

http://www.youtube.com/watch?v=Zu4WXiPRek
 - Group 1 contains the http://www.youtube.com/watch?v=Zu4WXiPRek

http://www.youtube.com/watch?v=Zu4WX&a=b
 - Group 1 contains http://www.youtube.com/watch?v=Zu4WX

http://www.youtube.com/watch?v=!Zu4WX£&P4ek
 - No match

a href="http://www.youtube.com/watch?v=Zu4WX&size=large"
 - Group 1 contains http://www.youtube.com/watch?v=Zu4WX

http://www.youtube.com/watch?v=Zu4WX£&P!ek
 - No match
methai