tags:

views:

207

answers:

2

Hi,

I'm trying to extract all matches from a EBML definition, which is something like this:

| + A track
|  + Track number: 3
|  + Track UID: 724222477
|  + Track type: subtitles
...
|  + Language: eng
...
| + A track
|  + Track number: 4
|  + Track UID: 745646561
|  + Track type: subtitles
...
|  + Language: jpn
...

I want all occurrences of "Language: ???" when preceded by "Track type: subtitles". I tried several variations of this:

Track type: subtitles.*Language: (\w\w\w)

I'm using the multi-line modifier in Ruby so it matches newlines (like the 's' modifier in other languages).

This works to get the last occurrence, which in the example above, would be 'jpn', for example:

string.scan(/Track type: subtitles.*Language: (\w\w\w)/m)
=> [["jpn"]]

The result I'd like:

=> [["eng"], ["jpn"]]

What would be a correct regex to accomplish this?

+3  A: 

You need to use a lazy quantifier instead of .*. Try this:

/Track type: subtitles.*?Language: (\w\w\w)/m

This should get you the first occurrence of "Language: ???" after each "Track type: subtitles:". But it would get confused if some track (of type subtitles) would be missing the Language field.


Another way to do this would be:

/^\| \+ (?:(?!^\| \+).)*?\+  Track type: subtitles$(?:(?!^\| \+).)*?^\|  \+ Language: (\w+)$/m

Looks somewhat messy, but should take care of the problem with the previous one.


A cleaner way would be to tokenize the string:

/^\| \+ ([^\r\n]+)|^\|  \+ Track type: (subtitles)|^\|  \+ Language: (\w+)/m

(Take note of the number of spaces)

For each match, you check which of the capture groups that are defined. Only one group will have any value for any single match.

  • If it is the first group, a new track has started. Discard any stored information about the previous track.
  • If it is the second group, the current track is of type subtitles.
  • If it is the third group, the language of this track is found.
  • Whenever you know the language of a track, and that it is of type subtitles, report it.
MizardX
+7  A: 

You need to make your regex non-greedy by changing this:

.*

To this:

.*?

Your regex is matching from the first occurence of Track type: subtitles to the last occurence of Language: (\w\w\w). Making it non-greedy will work because it matches as few characters as possible.

yjerem
To Jeremy:wait....you're 16 and understand 'geedyness'?!....and 8 Nice Answer badges?!!!...dang!...whatever job you're doing you're not getting paid enough....start offshore/nearshore codeing...like yesterday! you'll make a tonne of cash before you're even out of school.
Keng
Oh thank you, I have a few less grey hairs coming.
DaveShaw