tags:

views:

72

answers:

2

I'm trying to match and break up a typical tv torrent's title:

MyTV.Show.S09E01.HDTV.XviD
MyTV.Show.S10E02.HDTV.XviD
MyTV.Show.901.HDTV.XviD
MyTV.Show.1102.HDTV.XviD

I'm trying to break these strings up into 3 capture groups for each entry: Title, Season, Episode.

I can handle the first 2 easy enough:

^([a-zA-Z0-9.]*)\.S([0-9]{1,2})E([0-9]{1,2}).*$

However, the third and fourth one prove difficult to break apart the season and episode. If I could work backwards it would be easier. For example, with "901", If I could work backwards it would be take the first to digits as the episode number, anything remaining before that is the season number.

Does anyone have any tips for how I can break these strings up into those relevant capture groups?

+1  A: 

Almost every media file I've ever seen that has come from a torrent had two-digit episodes. With that, you should be able to do E([0-9]{2}). instead and get the expression to match.

I'd estimate 99.9% of shows are marked with two digit episodes. If you're trying to write a script to easily label your own shows, I'd go with the two digit episode assumption and manually rename mistagged files you come across. If you're trying to write something for public consumption, you probably have a lot more syntaxes that you'll need to consider. I've seen this tried by other applications in the past, and all have worked just so-so. It's a hard problem that probably has no single solution.

Dave McClelland
@Dave McClelland, your regex sample that you posted is addressing the portion that I have no problem with. When the letters 'S' and 'E' are present, I have no troubles. I'm looking for help with the format when they aren't there.
KingNestor
@King - Sorry - I had been editing my post to more accurately address your concerns, probably when you were already commenting. Does my update help any further?
Dave McClelland
+4  A: 

Here's what I would use:

(.*?)\.S?(\d{1,2})E?(\d{2})\.(.*)

Has capture groups:

1: Name
2: Season
3: Episode
4: The Rest

Here's some code in C# (courtesy of this post): see it live

using System;
using System.Text.RegularExpressions;

public class Test
{

    public static void Main()
    {
        string s = @"MyTV.Show.S09E01.HDTV.XviD
            MyTV.Show.S10E02.HDTV.XviD
            MyTV.Show.901.HDTV.XviD
            MyTV.Show.1102.HDTV.XviD";

        Extract(s);

    }

    private static readonly Regex rx = new Regex
        (@"(.*?)\.S?(\d{1,2})E?(\d{2})\.(.*)", RegexOptions.IgnoreCase);

    static void Extract(string text)
    {
        MatchCollection matches = rx.Matches(text);

        foreach (Match match in matches)
        {
            Console.WriteLine("Name: {0}, Season: {1}, Ep: {2}, Stuff: {3}\n",
                match.Groups[1].ToString().Trim(), match.Groups[2], 
                match.Groups[3], match.Groups[4].ToString().Trim());
        }
    }

}

Produces:

Name: MyTV.Show, Season: 09, Ep: 01, Stuff: HDTV.XviD
Name: MyTV.Show, Season: 10, Ep: 02, Stuff: HDTV.XviD
Name: MyTV.Show, Season: 9, Ep: 01, Stuff: HDTV.XviD
Name: MyTV.Show, Season: 11, Ep: 02, Stuff: HDTV.XviD
NullUserException
Interesting, I would of thought that (\d{1,2}) would of greedily tried to match 2 digits, since technically 2 were available.
KingNestor
@KingNestor It won't because then it would fail to match the `\d{2}` that comes after it.
NullUserException
So, in its process of matching, does it first attempt to match 2 and then backtrack later to try matching 1?
KingNestor