ansaurus

Question

Answer 1

+4 A:

Replace 
    \.mp3
with
    \.((mp3)|(ogg))

And beware of parsing HTML with regex.

Thomas L Holaday 2010-08-22 06:19:34

+1. `And beware of parsing HTML with regex`

gnucom 2010-08-22 06:36:12

Why double parentheses? `\.(mp3|ogg)` would be sufficient

seanizer 2010-08-22 07:07:37

I put in extra parentheses because my preconscious parses (this|that) as matching thishat or thithat, and I have to override consciously by applying precedence rules.

Thomas L Holaday 2010-08-22 15:26:40

Answer 2

+4 A:

Understanding the pattern

You have the following Java string literal:

// Java string literal
"<A HREF=\"([^\"]+)\"[^>]*>([^<]+?)\\.mp3</A>"

The pattern represented by this string, when all escape sequences are processed, is this:

// the regex pattern
<A HREF="([^"]+)"[^>]*>([^<]+?)\.mp3</A>

Now let's break this pattern apart:

_________       _     _        E________
<A HREF="([^"]+)"[^>]*>([^<]+?)\.mp3</A>
         \_____/       \______/
            1              2

So the parts of this regex are:

<A HREF=" matched literally
([^"]+), i.e. everything but doublequotes, captured in group 1
" mached literally
[^>]*, i.e. everything but >
> matched literally
([^<]+?), i.e. everything but <, as few as possible, captured in group 2
.mp3</A> matched literally (the . is escaped by backslash)

So looking at this, we can observe that the regex makes the following assumptions:

The href attribute value is matched by part 2; it must be enclosed in doublequotes, and itself can not contain any escaped doublequotes. This match is captured into group 1.
Any remaining attributes is matched by part 4. The href must be the first attribute, or else the regex wouldn't match.
Part 6 matches the filename, capturing into group 2.
Part 7 matches the extension, and immediately after, the closing element. The reluctance of part 6 is probably not necessary.

Parsing HTML with regex is a tricky business, but given numerous assumptions, the above regex seems capable of doing the job most of the time.

Modifying the pattern

Alternation in regex is done using the vertical bar. It's important to understand its precedence, and how grouping can be useful.

this|that matches one of these two strings:
- "this"
- "that"
this|that thing matches one of these two strings:
- "this"
- "that thing"
(this|that) thing matches one of these two strings:
- "this thing"
- "that thing"
(this|that) (thing|stuff) matches one of these four strings:
- "this thing"
- "that thing"
- "this stuff"
- "that stuff"

So to allow both mp3 and ogg extension, we can modify the mp3 in the pattern to (mp3|ogg). Note that this group will match and capture the extension into group 3.

The final pattern, therefore, is:

<A HREF="([^"]+)"[^>]*>([^<]+)\.(mp3|ogg)</A>
         \_____/       \_____/  \_______/
          1:url      2:filename   3:ext

As a Java string literal, this is:

"<A HREF=\"([^\"]+)\"[^>]*>([^<]+)\\.(mp3|ogg)</A>"

Appendix

The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.

The (…) is a capturing group. It allows the string that it matched to be retrieved later.

The * and + are repetition specifiers. By default, repetition is greedy (i.e. match as much as possible). The ? in +? makes it reluctant (i.e. match as few as possible).

Note that ? may also serve as optional repetition specifier in other contexts.

The . is a metacharacter that matches (almost) any character. Since we want a literal period, we escape it by preceding with doubleslash.

Note that regex pattern is by default case sensitive. In Java, you may want to use Pattern.CASE_INSENSITIVE flag (embeddable as (?i) in the pattern).

polygenelubricants 2010-08-22 08:09:59

+1 instead of just giving an answer, you really went above and beyond, explaining how and why. You are the type of person that SO needs more of.

PiPeep 2010-08-22 19:23:41

ansaurus

tags:

views:

answers:

Need help modifying regular expression

Understanding the pattern

Modifying the pattern

Appendix

related questions