tags:

views:

67

answers:

2

One of these days I'll get good at regex but for now...

I'm parsing an HTML page looking for MP3 files using the following expression (which works):

"<A HREF=\"([^\"]+)\"[^>]*>([^<]+?)\\.mp3</A>"

I now want to search for both MP3 and OGG files. Seems like a simple OR modification (.mp3 || .ogg), but I'm not quite sure how I put that in there? See http://stackoverflow.com/questions/2542559 for more info.

+4  A: 
Replace 
    \.mp3
with
    \.((mp3)|(ogg))

And beware of parsing HTML with regex.

Thomas L Holaday
+1. `And beware of parsing HTML with regex`
gnucom
Why double parentheses? `\.(mp3|ogg)` would be sufficient
seanizer
I put in extra parentheses because my preconscious parses (this|that) as matching thishat or thithat, and I have to override consciously by applying precedence rules.
Thomas L Holaday
+4  A: 

Understanding the pattern

You have the following Java string literal:

// Java string literal
"<A HREF=\"([^\"]+)\"[^>]*>([^<]+?)\\.mp3</A>"

The pattern represented by this string, when all escape sequences are processed, is this:

// the regex pattern
<A HREF="([^"]+)"[^>]*>([^<]+?)\.mp3</A>

Now let's break this pattern apart:

_________       _     _        E________
<A HREF="([^"]+)"[^>]*>([^<]+?)\.mp3</A>
         \_____/       \______/
            1              2

So the parts of this regex are:

  1. <A HREF=" matched literally
  2. ([^"]+), i.e. everything but doublequotes, captured in group 1
  3. " mached literally
  4. [^>]*, i.e. everything but >
  5. > matched literally
  6. ([^<]+?), i.e. everything but <, as few as possible, captured in group 2
  7. .mp3</A> matched literally (the . is escaped by backslash)

So looking at this, we can observe that the regex makes the following assumptions:

  • The href attribute value is matched by part 2; it must be enclosed in doublequotes, and itself can not contain any escaped doublequotes. This match is captured into group 1.
  • Any remaining attributes is matched by part 4. The href must be the first attribute, or else the regex wouldn't match.
  • Part 6 matches the filename, capturing into group 2.
  • Part 7 matches the extension, and immediately after, the closing element. The reluctance of part 6 is probably not necessary.

Parsing HTML with regex is a tricky business, but given numerous assumptions, the above regex seems capable of doing the job most of the time.


Modifying the pattern

Alternation in regex is done using the vertical bar. It's important to understand its precedence, and how grouping can be useful.

  • this|that matches one of these two strings:
    • "this"
    • "that"
  • this|that thing matches one of these two strings:
    • "this"
    • "that thing"
  • (this|that) thing matches one of these two strings:
    • "this thing"
    • "that thing"
  • (this|that) (thing|stuff) matches one of these four strings:
    • "this thing"
    • "that thing"
    • "this stuff"
    • "that stuff"

So to allow both mp3 and ogg extension, we can modify the mp3 in the pattern to (mp3|ogg). Note that this group will match and capture the extension into group 3.

The final pattern, therefore, is:

<A HREF="([^"]+)"[^>]*>([^<]+)\.(mp3|ogg)</A>
         \_____/       \_____/  \_______/
          1:url      2:filename   3:ext

As a Java string literal, this is:

"<A HREF=\"([^\"]+)\"[^>]*>([^<]+)\\.(mp3|ogg)</A>"

Appendix

The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.

The (…) is a capturing group. It allows the string that it matched to be retrieved later.

The * and + are repetition specifiers. By default, repetition is greedy (i.e. match as much as possible). The ? in +? makes it reluctant (i.e. match as few as possible).

Note that ? may also serve as optional repetition specifier in other contexts.

The . is a metacharacter that matches (almost) any character. Since we want a literal period, we escape it by preceding with doubleslash.

Note that regex pattern is by default case sensitive. In Java, you may want to use Pattern.CASE_INSENSITIVE flag (embeddable as (?i) in the pattern).

polygenelubricants
+1 instead of just giving an answer, you really went above and beyond, explaining how and why. You are the type of person that SO needs more of.
PiPeep