views:

259

answers:

3

Not every website exposes their data well, with XML feeds, APIs, etc

How could I go about extracting information from a website? For example:

...
<div>
  <div>
    <span id="important-data">information here</span>
  </div>
</div>
...

I come from a background of Java programming and coding with Apache XMLBeans. Is there anything similar to parse HTML, when I know the structure and the data is between a known tag?

Thanks

+4  A: 

Here's an article that has a couple of screen scraping tools written in java.

In general, it sounds like you want to take a look at regular expressions, which do the pattern matching you're looking for.

Hope that helps!

Zachary Yates
+3  A: 

There are several Open Source HTML Parsers out there for Java.

I have used JTidy in the past, and have had good luck with it. It will give you a DOM of the html page, and you should be able to grab the tags you need from there.

James Van Huis
A: 

Java seems like a fairly difficult constraint for such a task. Is that a hard requirement? Scripting languages are ideal for building what is really lots of last-mile code.

If you're be open to it, ruby + hpricot makes that completely trivial. You can use css or xpath selectors (or both) to find (and manipulate) the content in HTML. Grabbing the document, parsing it, and extracting the text in your example is literally one line of code.

Dustin