ansaurus

Question

Answer 1

+4 A:

Here's an article that has a couple of screen scraping tools written in java.

In general, it sounds like you want to take a look at regular expressions, which do the pattern matching you're looking for.

Hope that helps!

Zachary Yates 2008-11-25 19:26:17

Answer 2

+3 A:

There are several Open Source HTML Parsers out there for Java.

I have used JTidy in the past, and have had good luck with it. It will give you a DOM of the html page, and you should be able to grab the tags you need from there.

James Van Huis 2008-11-25 19:26:49

Answer 3

A:

Java seems like a fairly difficult constraint for such a task. Is that a hard requirement? Scripting languages are ideal for building what is really lots of last-mile code.

If you're be open to it, ruby + hpricot makes that completely trivial. You can use css or xpath selectors (or both) to find (and manipulate) the content in HTML. Grabbing the document, parsing it, and extracting the text in your example is literally one line of code.

Dustin 2008-11-25 19:45:13

ansaurus

tags:

views:

answers:

Extracting Information from websites

related questions