views:

130

answers:

6

Hi, when I try to extract this video ID (AIiMa2Fe-ZQ) with a regex expression, I can't get the dash an all the letters after.

Someone can help me please?

Thanks

>>> id = re.search('(?<=\?v\=)\w+', 'http://www.youtube.com/watch?v=AIiMa2Fe-ZQ')
>>> print id.group(0)
>>> AIiMa2Fe
+1  A: 
>>> re.search('(?<=v=)[\w-]+', 'http://www.youtube.com/watch?v=AIiMa2Fe-ZQ').group()
'AIiMa2Fe-ZQ'

\w is a short-hand for [a-zA-Z0-9_] in python2.x, you'll have to use re.A flag in py3k. You quite clearly have additional character in that videoid, i.e., hyphen. I've also removed redundant escape backslashes from the lookbehind.

SilentGhost
I think the `-ZQ$` is not part of the ID...
drewk
@drewk: OP quite clearly says that they are
SilentGhost
My bad -- sorry...
drewk
+1  A: 

/(?:/v/|/watch\?v=|/watch#!v=)([A-Za-z0-9_-]+)/

Explain the RE

There are three alternate YouTube formats: /v/[ID] and watch?v= and the new AJAX watch#!v= This RE captures all three. There is also new YouTube URL for user pages that is of the form /user/[user]?content={complex URI} This is not captured here by any regex...

drewk
+1 for youtube format coverage
manifest
+2  A: 

Intead of \w+ use below. Word character (\w) doesn't include a dash. It only includes [a-zA-Z_0-9].

[\w-]+
Taylor Leese
+1  A: 

I don't know the pattern for youtube hashes, but just include the "-" in the possibilities as it is not considered an alpha:

import re
id = re.search('(?<=\?v\=)[\w-]+', 'http://www.youtube.com/watch?v=AIiMa2Fe-ZQ')
print id.group(0)

I have edited the above because as it turns out:

>>> re.search("[\w|-]", "|").group(0)
'|'

The "|" in the character definition does not act as a special character but does indeed match the "|" pipe. My apologies.

manifest
is pipe allowed in a youtube ID? I don't think so.
SilentGhost
From the docs:"Some characters, like '|' or '(', are special.""A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B.""To match a literal '|', use \|, or enclose it inside a character class, as in [|]."
manifest
@manifest: **youtube video id doesn't contain `|`** (pipe).
SilentGhost
@SilentGhost Thanks, I had mistakenly believed the "|" (pipe) would act as a special character. I've corrected the answer.
manifest
+1  A: 

Use the urlparse module instead of regex for such kind of things.

import urlparse

parsed_url = urlparse.urlparse(url)
if parsed_url.netloc.find('youtube.com') != -1 and parsed_url.path == '/watch':
    video = urlparse.parse_qs(parsed_url.query).get('v', None)

    if video is None:
        video = urlparse.parse_qs(parsed_url.fragment.strip('!')).get('v', None)

    if video is not None:
        print video[0]

EDIT: Updated for the upcoming new youtube url format.

Ivo Wetzel
A: 

I'd try this:

>>> import re
>>> a = re.compile(r'.*(\-\w+)$')
>>> a.search('http://www.youtube.com/watch?v=AIiMa2Fe-ZQ').group(1)
'-ZQ'
hughdbrown