tags:

views:

105

answers:

1

I'm writing a regular expression to match data from the IMDb soundtracks data file. My regexes are mostly working, although they are in places slurping too much text into my named groups. Take the following regex for example:

"^  Performed by '?(?<performer>.*)('? \(qv\))?$"

The performer group includes the string ' (qv) as well as the performer's name. Unfortunately, because the records are not consistently formatted, some performers' names are surrounded by single quotation marks whilst others are not. This means they are optional as far as the regex is concerned.

I've tried marking the last group as a greedy group using the ?> group specifier, but this appeared to have no effect on the results.

I can improve the results by changing the performer group to match a small range of characters, but this reduces my chances of parsing the name out correctly. Furthermore, if I were to just exclude the apostrophe character, I would then be unable to parse, e.g., band names containing apostrophes, such as Elia's Lonely Friends Band who performed Run For Your Life featured in Resident Evil: Apocalypse.

Update: Here's an example input line that the regex should match, as requested. Other formats are also presented which my existing regex won't handle.

"  Performed by 'Carmen Silvera' (qv)"
+2  A: 

Here is a solution to your immediate problem, although I looked through the IMDB soundtracks data file, and this will not solve everything in there.

var exp = new Regex(@"^  Performed by '?(?<performer>.*?)('? \(qv\))?$");

Basically you need to specify a non-greedy search on the performer matching.

I'll add a comment to explain why this isn't going to be good enough for your project long term.

Andrew Anderson
The "*fun*" issue that you will run into is going to be multi-performer input like this:Performed by 'José Carreras (IV)' (qv), 'Fina Brunet' (qv), 'Susanna Griso' (qv) and 'Gemma Nierga' (qv)Combined with the fact that the name parsing is shared between a number of different tags (and not just "Performed by"), this suggests to me that you want to find a good way to extract a list of all names from a string in the general case.
Andrew Anderson
Yes, I currently have separate regexes for publisher, performer, lyricist and composer, with some fudging of the input to bring it into line (e.g., "Written by" sets the composer and lyricist properties of my parsed object to the same value). I am aware that there are many scenarios my regexes don't capture at the moment; I wanted to start off with something simple and build it up a little at a time.
alastairs
Fair enough - good luck with your project - if nothing else, the non-standardized formatting should provide for some fun mental puzzles.
Andrew Anderson
For some definitions of fun, yes :-)
alastairs