views:

292

answers:

5

I have music file names like:

Gorillaz (2001)
Gorillaz (7th State Mix) (2002)
Gorillaz (2001) (Featuring Travis)
Gorillaz (1Mix) (2003)
Gorillaz (1000) (2001)

How do I parse the year in the cleanest, easiest way?

Right now I am parsing them by finding each '(' and then making sure the character count between the ()s are 4 and first char is 1 or 2 and they can be parsed, using TryParse?

Can I parse these kinds of strings using a single Regex?


Edit:

The year can be max 50-60 years old, so not older than 1950.

A: 

Looks tricky unless we know more about what some of those parenthese are for: if you could have a "(1000)" that's not really a year, you could probably have a "(2000)" that's not really a year also. I'm talking about the last line in your sample:

Gorillaz (1000) (2001)

If that's valid, why not something like this? :

Gorillaz (2000) (2001)

Where the (2000) in the latter example fills the same conceptual role as the (1000) from the former (it's not a year). How will your regex know which is the year? If you know this won't happen, how do you now it won't happen?

Joel Coehoorn
He says the music is not older than 60 years.
Dead account
Thanks Joel. What do you mean when you said 2000 isn't a year?
Joan Venge
I see. The year will only start from 1950. If there are 2 valid years which might, then only the first one will be used. I will document this so if the user has files like these, then it's his responsibility to make sure they only have 1 year.
Joan Venge
I don't think you quite have it yet. I'm talking about a separate attribute that's not a year at all, but still happens on occasion to be 4 a digit number between 1950 and 2050.
Joel Coehoorn
I see. But you mean a value like that but also inside parenthesis? Isn't that unlikely? :)
Joan Venge
From your sample, it looks almost inevitable.
Joel Coehoorn
Yeah I see what you mean. But I think that's still not very likely. You are right I had values like that. It's just that sometimes you come across half number half text in parenthesis, sometimes all numbers but not exactly 4 characters. If 4 characters then, most likely not a valid year. If it's a valid year, mostly it's outside the parenthesis like Overkill 2000 or something, but if it's inside the parenthesis, there is definitely a user problem with file names. A music file shouldn't have 2 different years both inside parenthesis and both valid. Makes sense?
Joan Venge
Also I mentioned Gorillaz (1000) (2001), just to eliminate cases with very close but invalid years. If more than 1 valid years, then I will just choose the 1st one.
Joan Venge
+2  A: 

This regex will match your pattern:

@"\(([12]\d{3})\)"

You can then extract Group 1 to get the year. You can then use Convert.ToInt32 to get the year as an int, and check it is greater than 1950 (it's probably better to do this as a numeric comparison rather than overcomplicating the regex).

Ben Lings
+1  A: 

you should be able to match this using regex. Here is a pattern you might try to use:

\([12][0-9]{3}\)

Don't forget to enable greedy. This will match the (1000) on the last line, as well. Is this wanted, too?

Edit:

 \((19|20)[0-9]{2}\)

will do the job if you don't want the (1000) as a match

regards

Atmocreations
I think that it's not supposed to match the `(1000)` in the last line since the year won't be any earlier than 1950.
Velociraptors
it does not match 1000, right. well... 1000 _is_ actually earlier than 1950 ;) but yes, that's not the reason it doesn't get matched.jon's pattern in his code looks better anyway :P
Atmocreations
+8  A: 

I think this does what you're after:

using System;
using System.Text.RegularExpressions;

class Test
{
    static void Main()
    {
        string[] samples = new[] { "Gorillaz (2001)",
                "Gorillaz (7th State Mix) (2002)",
                "Gorillaz (2001) (Featuring Travis)",
                "Two matches: (2002) (1950)",
                "Gorillaz (1Mix) (1952)",
                "Gorillaz (1Mix) (2003)",
                "Gorillaz (1000) (2001)" };

        foreach (string name in samples)
        {
            ShowMatches(name);
        }
    }

    static readonly Regex YearRegex = new Regex(@"\((19[5-9]\d|200\d)\)");

    static void ShowMatches(string name)
    {
        Console.WriteLine("Matches for: {0}", name);
        foreach (Match match in YearRegex.Matches(name))
        {
            Console.WriteLine(match.Value);
        }
    }
}

That will work as far as 2009. To make it work beyond that, use @"((19[5-9]\d|20[01]\d))" etc.

Note that that still prints out the brackets - you could get rid of them with a group construct, but personally I'd just use Substring :)

Jon Skeet
Might want to make that 2\d{3} -- then you won't need to rewrite it for the next thousand years.
tvanfosson
No, it's deliberately like that - so that it won't miscategorize 2500 (which clearly *isn't* a year).
Jon Skeet
(Admittedly when you make it 20[01]\d it'll still assume that 2019 is a year when it's only 2010, but hey...)
Jon Skeet
@Jon, but it will be.
tvanfosson
Thanks Jon, do you know what do I need to cover say a 149 years? SO from 1950 to 2099? Can I come up with a single regex for that?
Joan Venge
I just tried @"((19[5-9]\d|20[01]\d))", but that didn't match 2055. I thought it's better to find a regex that's good for 149 years.
Joan Venge
I used @"\((19[5-9]\d|20[0-9]\d)\)", it worked. Is this an ok regex to use?
Joan Venge
Well you could use that, or just \d\d at the end to match two digits (or indeed \d{2}).
Jon Skeet
Thanks Jon. .
Joan Venge
Btw I tried replacing the end like you said, but I mess it up. Is it how it should be? @"\((19[5-9]\d|20[0-9]\d{2}))"? When I do that, it doesn't find any match.
Joan Venge
+1  A: 

Considering you have things that look like years but aren't e.g. (1000) I would look for 19**, 20**, and maybe 21** if you think your program is going to be around for a while :)

/\(19\d\d|20\d\d|21\d\d\)/

For your inputs this gives:

2001
2002
2001
2003
2001
2001