views:

808

answers:

8

Well, I'm pretty much trying to figure out how to pull information from a webpage, and bring it into my program (in Java).

For example, if I know the exact page I want info from, for the sake of simplicity a Best Buy item page, how would I get the appropriate info I need off of that page? Like the title, price, description?

What would this process even be called? I have no idea were to even begin researching this.

Edit: Okay, I'm running a test for the JSoup(the one posted by BalusC), but I keep getting this error:

Exception in thread "main" java.lang.NoSuchMethodError: java.util.LinkedList.peekFirst()Ljava/lang/Object;
at org.jsoup.parser.TokenQueue.consumeWord(TokenQueue.java:209)
at org.jsoup.parser.Parser.parseStartTag(Parser.java:117)
at org.jsoup.parser.Parser.parse(Parser.java:76)
at org.jsoup.parser.Parser.parse(Parser.java:51)
at org.jsoup.Jsoup.parse(Jsoup.java:28)
at org.jsoup.Jsoup.parse(Jsoup.java:56)
at test.main(test.java:12)

' I do have Apache Commons

+1  A: 

You may use an html parser (many useful links here: java html parser).

The process is called 'grabbing website content'. Search 'grab website content java' for further invertigation.

Roman
A: 

Look into the cURL library. I've never used it in Java, but I'm sure there must be bindings for it. Basically, what you'll do is send a cURL request to whatever page you want to 'scrape'. The request will return a string with the source code to the page. From there, you will use regex to parse whatever data you want from the source code. That's generally how you are going to do it.

Nelson
[Don't use regex to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454).
BalusC
+6  A: 

This is referred to as screen scraping, wikipedia has this article on the more specific web scraping. It can be a major challenge because there's some ugly, mess-up, broken-if-not-for-browser-cleverness HTML out there, so good luck.

sblundy
A: 

You'd probably want to look at the HTML to see if you can find strings that are unique and near your text, then you can use line/char-offsets to get to the data.

Could be awkward in Java, if there aren't any XML classes similar to the ones found in System.XML.Linq in C#.

Kurru
+16  A: 

Use a HTML parser like Jsoup. This has my preference above the other HTML parsers available in Java since it supports jQuery like CSS selectors. Also, its class representing a list of nodes, Elements, implements Iterable so that you can iterate over it in an enhanced for loop (so there's no need to hassle with verbose Node and NodeList like classes in the average Java DOM parser).

Here's a basic kickoff example (just put the latest Jsoup JAR file in classpath):

package com.stackoverflow.q2835505;

import java.net.URL;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Test {

    public static void main(String[] args) throws Exception {
        URL url = new URL("http://stackoverflow.com/questions/2835505");
        Document document = Jsoup.parse(url, 3000);

        String question = document.select("#question .post-text").text();
        System.out.println("Question: " + question);

        Elements answerers = document.select("#answers .user-details a");
        for (Element answerer : answerers) {
            System.out.println("Answerer: " + answerer.text());
        }
    }

}

As you might have guessed, this prints your own question and the names of all answerers.

BalusC
That's a pretty slick package.
JSBangs
JSoup looks awesome. Seriously.
Esko
Wow, this is nice! I have a question though, I just copy and pasted this just to do a test run, but I keep getting this error(look at edited OP)
James
@James: This requires at least Java 1.6 (which is already out over 3 years). The mentioned [`LinkedList#peekFirst()`](http://java.sun.com/javase/6/docs/api/java/util/LinkedList.html#peekFirst%28%29) method was introduced in Java 1.6. Upgrade your JVM (JDK) or configure your IDE (Eclipse?) to Java 6 compliance mode.
BalusC
Thanks, BalusC!!
James
You're welcome.
BalusC
If any .NET programmers are interested, I've ported jsoup to .NET: http://nsoup.codeplex.com/. Hope this helps anyone.
GeReV
+2  A: 

The process itself is normally called "scraping." You can use a parser like TagSoup to process the page once you've fetched it.

Bill the Lizard
+3  A: 

I would use JTidy - it is simlar to JSoup, but I don't know JSoup well. JTidy handles broken HTML and returns a w3c Document, so you can use this as a source to XSLT to extract the content you are really interested in. If you don't know XSLT, then you might as well go with JSoup, as the Document model is nicer to work with than w3c.

EDIT: A quick look on the JSoup website shows that JSoup may indeed be the better choice. It seems to support CSS selectors out the box for extracting stuff from the document. This may be a lot easier to work with than getting into XSLT.

mdma
A: 

JSoup solution is great, but if you need to extract just something really simple it may be easier to use regex or String.indexOf

As others have already mentioned the process is called scraping

Anton
Why it would be easier to use regex? I have tried regex and it really can't handle real life html and its possibly dangerous to use parse html. Jsoup is out of the box solution, just few line codes and you do what ever you need to do with your html..
newbie
Oversimplified example - Imagine all you want is to extract the date the page was generated. So you check the html and see something like `<span id='date'>07/07/07</span>`. Well, then I would use String.indexOf or some of my own utilities like textBetween("<span id='date'>", "</span>"). An added benefit is that you don't have to parse the whole html. I've had success extracting data from html with a home-grown StringScanner class with methods like moveBefore(String what), moveAfter(String what), getTextUpTo(String what), ... It all depends on how complicated your problem is.
Anton