ansaurus

Question

Java - Read a website and NOT the source

Answer 1

+5 A:

When you request a page you get the source. This is what's expected and normal. You'll have to parse this source to extract the content.

kobrien 2010-07-08 00:11:22

Answer 2

+1 A:

When you retrieve a web page, what the server sends you is everything between the HTML tags, and more.

I think what you are looking for is a HTML parser, which will let you extract content from the web page. First you retrieve the web page as you are currently doing, then run the output through the parser, instructing the parser to extract the part that you want.

Here are some HTML parsers:

Swing HTML Parser - article shows how to use Java's Swing library to do some HTML parsing
HTML Parser
Java Mozilla HTML Parser

Jeff 2010-07-08 00:15:37

Answer 3

+7 A:

Unless you have control over post.php and are able to make it return just what you need without the HTML tags (a la web services), you will have to parse the HTML document returned by it.

Use a HTML Parser, regular expressions are not very reliable for this.

Rough Snippet to parse the <body> tag with HTMLParser:

(Make sure to include htmlparser.jar)

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;    
import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.tags.BodyTag;    

public class HTMLParserTest {   
    public static String grabBodyTag (String url) {
        if(!url.startsWith("http://")){url = "http://" + url;}      
        Parser parser = new Parser();               
        TagNameFilter filter = new TagNameFilter("body");       
        try {
            parser.setResource(url);
            NodeList list = parser.parse(filter);
            Node node = list.elementAt(0);          
            if (node instanceof BodyTag) {
                BodyTag tag = (BodyTag) node;
                return   tag.toPlainTextString(); //other formats are available
            }
        } catch (ParserException e) {
            e.printStackTrace();
        }       
        return "found no body tag...";
    }   
    public static void main(String... args){
        System.out.println(grabBodyTag("google.com"));
    }

}

This gives a String with "Web Images Videos Maps News Books Gmail more..." [omitted], in your case it will return a String with "1" in it possibly with whitespace in it (as your pastebin shows), you have to trim it and then do the conversion to a number.

Closing Note: making a post.php with (and only) the following code will make your life much easier if you don't need that script for any other thing that to return this result.

<?php
$number = 1; // or whatever login to get it.
echo $number;
?>

Bakkal 2010-07-08 00:17:11

+! for mentioning having control over post.php. The OP certainly can make life a lot easier on himself if he just dumps text as a result of the request.

Tim Bender 2010-07-08 00:41:47

Yes I have control over post.php but uh... ok. This example you gave me.. did not work. I get things like Parser parser not found.

Dan 2010-07-08 00:54:05

@Dan do you understand that to use a library (HTMLParser) you have to include it in your project to be "found"? That snippet is what I use and it most certainly **works** to grab `body` tags. If you have control over `post.php` just do `echo "result in text format";` without the rest of the HTML document, and you will get it as a `String` with your URL connection snippet.

Bakkal 2010-07-08 00:57:07

Yeah I included it in the JAR file but I'll just go ahead and not include any html tags and just PHP. Thanks.

Dan 2010-07-08 02:08:45

Answer 4

+2 A:

Scraping stuff out of HTML formatted response is unpleasant, and can make your code fragile.

Maybe the webapp / website you are trying to talk has other ways to deliver the responses; e.g. in XML or JSON format.

Getting responses in an alternative format might entail setting an appropriate ACCEPT header to the HTTP request, adding some extra parameter to the query, or changing the path.

Check the web API documentation for the webapp / website to see if there is any mention of this.
Or check the webapp source code ... if you have it.
Or if this is your code, consider changing it to support XML, JSON or even ad hoc text responses. (If you take this route, it would be a good idea to read up on media types and set the appropriate one in the "Content-type" header of your responses.)

Stephen C 2010-07-08 01:17:58

+1 for suggesting using a more appropriate format

Pascal Thivent 2010-07-08 02:39:30

Answer 5

+10 A:

The problem? When I run it... I get the WHOLE page... EVEN THE CODE SOURCE such as the beginning of the html tag all the way to the end of the body and html tag.

Well, that's basically what an HTML page is; so that's what you get. Now, if you don't want to parse the content manually, use an HTML Parser. There are many of them but I would recommend Jsoup, one of the most elegant available library (clean and nice API, jQuery like CSS selectors, non-verbose element iteration, etc). Demo:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JsoupDemo {
    public static void main(String[] args) throws Exception {
        URL url = new URL("http://localhost/mystikrpg/post.php?players");
        Document doc = Jsoup.parse(url, 3*1000);

        String text = doc.body().text();

        System.out.println(text); // outputs 1
    }
}

Look Ma, no hands!

PS: As a side note, I must say that I agree with some other answers here, you should maybe consider producing something else than HTML like XML, JSON or even raw text (at least as an alternative to the HTML version if you really need it).

Pascal Thivent 2010-07-08 01:33:11

Jsoup looks like a good library. Thanks for sharing.

James P. 2010-07-08 09:24:36

ansaurus

tags:

views:

answers:

Java - Read a website and NOT the source

related questions