tags:

views:

172

answers:

5

OK so I redefined my last program... here it is:

import java.io.BufferedReader; 
import java.io.InputStreamReader;
import java.net.URL; 
import java.net.URLConnection;


public class asp {
    public static void main(String[] args) {
        try {
            URL game = new URL("http://localhost/mystikrpg/post.php?players");
            URLConnection connection = game.openConnection();
            BufferedReader in = new BufferedReader(new
            InputStreamReader(connection.getInputStream()));
            String inputLine;
            while ((inputLine = in.readLine()) != null) {
                System.out.println(inputLine);
            }
            in.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

The problem? When I run it... I get the WHOLE page... EVEN THE CODE SOURCE such as the beginning of the html tag all the way to the end of the body and html tag.

When really... I want it to output is the 1.... The only way I can see it is if I split the string from <body> and </body>...

Meh. Help?

+5  A: 

When you request a page you get the source. This is what's expected and normal. You'll have to parse this source to extract the content.

kobrien
+1  A: 

When you retrieve a web page, what the server sends you is everything between the HTML tags, and more.

I think what you are looking for is a HTML parser, which will let you extract content from the web page. First you retrieve the web page as you are currently doing, then run the output through the parser, instructing the parser to extract the part that you want.

Here are some HTML parsers:

Jeff
+7  A: 

Unless you have control over post.php and are able to make it return just what you need without the HTML tags (a la web services), you will have to parse the HTML document returned by it.

Use a HTML Parser, regular expressions are not very reliable for this.


Rough Snippet to parse the <body> tag with HTMLParser:

(Make sure to include htmlparser.jar)

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;    
import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.tags.BodyTag;    

public class HTMLParserTest {   
    public static String grabBodyTag (String url) {
        if(!url.startsWith("http://")){url = "http://" + url;}      
        Parser parser = new Parser();               
        TagNameFilter filter = new TagNameFilter("body");       
        try {
            parser.setResource(url);
            NodeList list = parser.parse(filter);
            Node node = list.elementAt(0);          
            if (node instanceof BodyTag) {
                BodyTag tag = (BodyTag) node;
                return   tag.toPlainTextString(); //other formats are available
            }
        } catch (ParserException e) {
            e.printStackTrace();
        }       
        return "found no body tag...";
    }   
    public static void main(String... args){
        System.out.println(grabBodyTag("google.com"));
    }

}

This gives a String with "Web Images Videos Maps News Books Gmail more..." [omitted], in your case it will return a String with "1" in it possibly with whitespace in it (as your pastebin shows), you have to trim it and then do the conversion to a number.

Closing Note: making a post.php with (and only) the following code will make your life much easier if you don't need that script for any other thing that to return this result.

<?php
$number = 1; // or whatever login to get it.
echo $number;
?>
Bakkal
+! for mentioning having control over post.php. The OP certainly can make life a lot easier on himself if he just dumps text as a result of the request.
Tim Bender
Yes I have control over post.php but uh... ok. This example you gave me.. did not work. I get things like Parser parser not found.
Dan
@Dan do you understand that to use a library (HTMLParser) you have to include it in your project to be "found"? That snippet is what I use and it most certainly **works** to grab `body` tags. If you have control over `post.php` just do `echo "result in text format";` without the rest of the HTML document, and you will get it as a `String` with your URL connection snippet.
Bakkal
Yeah I included it in the JAR file but I'll just go ahead and not include any html tags and just PHP. Thanks.
Dan
+2  A: 

Scraping stuff out of HTML formatted response is unpleasant, and can make your code fragile.

Maybe the webapp / website you are trying to talk has other ways to deliver the responses; e.g. in XML or JSON format.

Getting responses in an alternative format might entail setting an appropriate ACCEPT header to the HTTP request, adding some extra parameter to the query, or changing the path.

  • Check the web API documentation for the webapp / website to see if there is any mention of this.
  • Or check the webapp source code ... if you have it.
  • Or if this is your code, consider changing it to support XML, JSON or even ad hoc text responses. (If you take this route, it would be a good idea to read up on media types and set the appropriate one in the "Content-type" header of your responses.)
Stephen C
+1 for suggesting using a more appropriate format
Pascal Thivent
+10  A: 

The problem? When I run it... I get the WHOLE page... EVEN THE CODE SOURCE such as the beginning of the html tag all the way to the end of the body and html tag.

Well, that's basically what an HTML page is; so that's what you get. Now, if you don't want to parse the content manually, use an HTML Parser. There are many of them but I would recommend Jsoup, one of the most elegant available library (clean and nice API, jQuery like CSS selectors, non-verbose element iteration, etc). Demo:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JsoupDemo {
    public static void main(String[] args) throws Exception {
        URL url = new URL("http://localhost/mystikrpg/post.php?players");
        Document doc = Jsoup.parse(url, 3*1000);

        String text = doc.body().text();

        System.out.println(text); // outputs 1
    }
}

Look Ma, no hands!

PS: As a side note, I must say that I agree with some other answers here, you should maybe consider producing something else than HTML like XML, JSON or even raw text (at least as an alternative to the HTML version if you really need it).

Pascal Thivent
Jsoup looks like a good library. Thanks for sharing.
James P.