ansaurus

Question

How do you grab a text from webpage (Java)?

Answer 1

A:

In short, you may either parse the whole page and pick things you need(for speed I recommend looking at SAXParser) or running the HTML through a regexp that trims of all of the HTML... you can also convert it all into DOM, but that's going to be expensive especially if you're shooting for having a decent throughput.

2008-09-16 11:51:56

Answer 2

A:

You seem to want to screen scrape. You would probably want to write a framework which via an adapter / plugin per source site (as each site's format will differ), you could parse the html source and extract the text. you would prob use java's io API to connect to the URL and stream the data via InputStreams.

2008-09-16 11:52:13

Answer 3

+2 A:

You could look at how httpunit does it. They use couple of decent html parsers, one is nekohtml. As far as getting data you can use whats built into the jdk (httpurlconnection), or use apache's

http://hc.apache.org/httpclient-3.x/

James Law 2008-09-16 11:54:49

Neko sounds good — I like cats :)

AnSGri 2008-09-16 11:59:27

Answer 4

+3 A:

You may use HTMLParser (http://htmlparser.sourceforge.net/)in combination with URL#getInputStream() to parse the content of HTML pages hosted on Internet.

jatanp 2008-09-16 11:57:03

Answer 5

A:

If you want to do it the old fashioned way , you need to connect with a socket to the webserver's port , and then send the following data :

GET /file.html HTTP/1.0
Host: site.com
<ENTER>
<ENTER>

then use the Socket#getInputStream , and then read the data using a BufferedReader , and parse the data using whatever you like.

Vhaerun 2008-09-16 12:06:24

Is it really the best way to get a page?

AnSGri 2008-09-16 12:15:48

Answer 6

+2 A:

If you want to take advantage of any structural or semantic markup, you might want to explore converting the HTML to XML and using XQuery to extract the information in a standard form. Take a look at this IBM developerWorks article for some typical code, excerpted below (they're outputting HTML, which is, of course, not required):

<table>
{
  for $d in //td[contains(a/small/text(), "New York, NY")]
  for $row in $d/parent::tr/parent::table/tr
  where contains($d/a/small/text()[1], "New York")
  return <tr><td>{data($row/td[1])}</td> 
           <td>{data($row/td[2])}</td>              
           <td>{$row/td[3]//img}</td> </tr>
}
</table>

Joe Liversedge 2008-09-16 12:25:35

Answer 7

A:

You can use nekohtml to parse your html document. You will get a DOM document. You may use XPATH to retrieve data you need.

Alexandre Victoor 2008-09-16 12:31:41

Answer 8

A:

If your "web sources" are regular websites using HTML (as opposed to structured XML format like RSS) I would suggest to take a look at HTMLUnit.

This library, while targeted for testing, is a really general purpose "Java browser". It is built on a Apache httpclient, Nekohtml parser and Rhino for Javascript support. It provides a really nice API to the web page and allows to traverse website easily.

Maxim 2008-09-16 13:05:42

Answer 9

A:

Have you considered taking advantage of RSS/Atom feeds? Why scrape the content when it's usually available for you in a consumable format? There are libraries available for consuming RSS in just about any language you can think of, and it'll be a lot less dependent on the markup of the page than attempting to scrape the content.

If you absolutely MUST scrape content, look for microformats in the markup, most blogs (especially WordPress based blogs) have this by default. There are also libraries and parsers available for locating and extracting microformats from webpages.

Finally, aggregation services/applications such as Yahoo Pipes may be able to do this work for you without reinventing the wheel.

Eric DeLabar 2008-09-16 13:12:29

Answer 10

+1 A:

I agree with Maxim; HTMLUnit is very useful for this kind of work.

PaulF 2008-09-16 13:20:13

ansaurus

tags:

views:

answers:

How do you grab a text from webpage (Java)?

related questions