views:

208

answers:

3

Ok I know everyone is going to tell me not to use RegEx for parsing HTML, but I'm programming on Android and don't have ready access to an HTML parser (that I'm aware of). Besides, this is server generated HTML which should be more consistent than user-generated HTML.

The regex looks like this:

Pattern patternMP3 = Pattern.compile(
        "<A HREF=\"[^\"]+.+\\.mp3</A>",
        Pattern.CASE_INSENSITIVE |
        Pattern.UNICODE_CASE);
Matcher matcherMP3 = patternMP3.matcher(HTML);
while (matcherMP3.find()) { ... }

The input HTML is all on one line, which is causing the problem. When the HTML is on separate lines this pattern works. Any suggestions?

A: 

You shouldn't be matching '.+' since you've already got [^\"]+ (which is better for your particular situation).

Try:

"<A HREF=\"[^\"]+\\.mp3\"</A>"

Also, don't forget the double-quote after the mp3.

CWF
Not getting any matches with this. Note that I'm attempting to capture all text after <A HREF=" and before .mp3</A>. I need both the hyperlink and file name, which I parse later.
DiskCrasher
+1  A: 

The regex

"<A HREF=\"([^\"]+)\"[^>]*>([^<]+?)\\.mp3</A>"

should match your links, and have the link and the filename in its groups. Note though, that the argument of href does not neccesarily need to be enclosed in quotes in html. (Or, if it needs to be, neither browsers nor developers know that =). )

Jens
+1, altho I'd make `[^<]+` greedy here
Qtax
Cool, this looks to be working. Thanks!
DiskCrasher
A: 

For your information, on Android you can parse HTML 'properly' with a combination of org.cyberneko.html.parsers.SAXParser, org.xml.sax.* and org.dom4j.*.

http://sourceforge.net/projects/nekohtml

http://www.saxproject.org

http://dom4j.sourceforge.net

Jim Blackler