tags:

views:

71

answers:

5

I'm having trouble coming up with a regular expression to match a particular case. I have a list of tv shows in about 4 formats:

  • Name.Of.Show.S01E01
  • Name.Of.Show.0101
  • Name.Of.Show.01x01
  • Name.Of.Show.101

What I want to match is the show name. My main problem is that my regex matches the name of the show with a preceding '.'. My regex is the following:

"^([0-9a-zA-Z\.]+)(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3})"

Some Examples:

>>> import re

>>> SHOW_INFO = re.compile("^([0-9a-zA-Z\.]+)(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3})")
>>> match = SHOW_INFO.match("Name.Of.Show.S01E01")
>>> match.groups()
('Name.Of.Show.', 'S01E01')
>>> match = SHOW_INFO.match("Name.Of.Show.0101")
>>> match.groups()
('Name.Of.Show.0', '101')
>>> match = SHOW_INFO.match("Name.Of.Show.01x01")
>>> match.groups()
('Name.Of.Show.', '01x01')
>>> match = SHOW_INFO.match("Name.Of.Show.101")
>>> match.groups()
('Name.Of.Show.', '101')

So the question is how do I avoid the first group ending with a period? I realize I could simply do:

var.strip(".")

However, that doesn't handle the case of "Name.Of.Show.0101". Is there a way I could improve the regex to handle that case better?

Thanks in advance.

+2  A: 

So the only real restriction on the last group is that it doesn’t contain a dot? Easy:

^(.*?)(\.[^.]+)$

This matches anything, non-greedily. The important part is the second group, which starts with a dot and then matches any non-dot character until the end of the string.

This works with all your test cases.

Konrad Rudolph
Thanks, that looks good, nice and concise.
AJ
+2  A: 

I think this will do:

>>> regex = re.compile(r'^([0-9a-z.]+)\.(S[0-9]{2}E[0-9]{2}|[0-9]{3,4}|[0-9]{2}x[0-9]{2})$', re.I)
>>> regex.match('Name.Of.Show.01x01').groups()
('Name.Of.Show', '01x01')
>>> regex.match('Name.Of.Show.101').groups()
('Name.Of.Show', '101')

ETA: Of course, if you're just trying to extract different bits from trusted strings you could just use string methods:

>>> 'Name.Of.Show.101'.rpartition('.')
('Name.Of.Show', '.', '101')
SilentGhost
Thanks, it never even crossed my mind to include the . outside both of the groups. I didn't show the entire string, there are usually other items after the episode #'s like "The.Name.Of.Show.S01E01.something.else", so rpartition wouldn't work.
AJ
@AJ: then you should be careful not to include `$` into the regex
SilentGhost
A: 

If the last part never contains a dot: ^(.*)\.([^\.]+)$

Jan Willem B
+1  A: 

I believe this will do what you want:

^([0-9a-z\.]+)\.(?:S[0-9]{2}E[0-9]{2}|[0-9]{3,4}|[0-9]{2}(?:x[0-9]+)?)$

I tested this against the following list of shows:

  • 30.Rock.S01E01
  • The.Office.0101
  • Lost.01x01
  • How.I.Met.Your.Mother.101

If those 4 cases are representative of the types of files you have, then that regex should place the show title in its own capture group and toss away the rest. This filter is, perhaps, a bit more restrictive than some others, but I'm a big fan of matching exactly what you need.

ABach
+1  A: 

It seems like the problem is that you haven't specified that the period before the last group is required, so something like ^([0-9a-zA-Z\.]+)\.(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3}) might work.

Mark