ansaurus

Question

Trying to parse links in an HTML directory listing using Java regex

Answer 1

A:

You shouldn't be matching '.+' since you've already got [^\"]+ (which is better for your particular situation).

Try:

"<A HREF=\"[^\"]+\\.mp3\"</A>"

Also, don't forget the double-quote after the mp3.

CWF 2010-03-30 02:53:08

Not getting any matches with this. Note that I'm attempting to capture all text after <A HREF=" and before .mp3</A>. I need both the hyperlink and file name, which I parse later.

DiskCrasher 2010-03-30 04:33:56

Answer 2

+1 A:

The regex

"<A HREF=\"([^\"]+)\"[^>]*>([^<]+?)\\.mp3</A>"

should match your links, and have the link and the filename in its groups. Note though, that the argument of href does not neccesarily need to be enclosed in quotes in html. (Or, if it needs to be, neither browsers nor developers know that =). )

Jens 2010-03-30 06:20:47

+1, altho I'd make `[^<]+` greedy here

Qtax 2010-03-30 14:38:29

Cool, this looks to be working. Thanks!

DiskCrasher 2010-03-31 23:59:05

Answer 3

A:

For your information, on Android you can parse HTML 'properly' with a combination of org.cyberneko.html.parsers.SAXParser, org.xml.sax.* and org.dom4j.*.

http://sourceforge.net/projects/nekohtml

http://www.saxproject.org

http://dom4j.sourceforge.net

Jim Blackler 2010-03-30 11:55:48

ansaurus

tags:

views:

answers:

Trying to parse links in an HTML directory listing using Java regex

related questions