ansaurus

Question

How to get web content before visit that web page

Answer 1

+4 A:

Idea: Open the URL as a stream, then HTML-parse the String in its description meta tag.

Grab URL content:

URL url = new URL("http://www.url-to-be-parsed.com/page.html");
    BufferedReader in = new BufferedReader(
                new InputStreamReader(
                url.openStream()));

Will need to tweak the above code depending on what your HTML parser library requires (a stream, strings, etc).

HTML-Parse the tags:

<meta name="description" content="This is a place where webmasters can put a description about this web page" />

You might also be interested in grabbing the title of that page:

<title>This is the title of the page!</title>

Caution: Regular expressions do not seem to work reliably on HTML documents, so a HTML-parser is better.

An example with HTML Parser:

Use HasAttributeFilter to filter by tags that have name="description" attribute
try a Node ---> MetaTag casting
Get the content using MetaTag.getAttribute()

Code:

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.tags.MetaTag;

public class HTMLParserTest {
    public static void main(String... args) {
        Parser parser = new Parser();
        //<meta name="description" content="Some texte about the site." />
        HasAttributeFilter filter = new HasAttributeFilter("name", "description");
        try {
            parser.setResource("http://www.youtube.com");
            NodeList list = parser.parse(filter);
            Node node = list.elementAt(0);

            if (node instanceof MetaTag) {
                MetaTag meta = (MetaTag) node;
                String description = meta.getAttribute("content");

                System.out.println(description);
                // Prints: "YouTube is a place to discover, watch, upload and share videos."
            }

        } catch (ParserException e) {
            e.printStackTrace();
        }
    }

}

Considerations:

If this is done in a JSP each time the page is loaded, you might get a slowdown due to the network I/O to the URL. Even worse if you do this each time on-the-fly for a page of yours that has many URL links in it, then the slowdown could be massive due to the sequential operation of n URLs. Maybe you can store this information in a database and refresh them as needed instead of doing in it on-the-fly in the JSPs.

Bakkal 2010-06-30 05:42:16

::Thank you very much for your reply.I want to extract the content information of the meta tag.I'm using html parser (http://htmlparser.sourceforge.net/samples.html).could you please help me..

udayalkonline 2010-06-30 15:29:02

There you go. Took me a while to make my way around their API. Seems to work fine as it is. As I will be using this too, I'll update if I find more efficient ways.

Bakkal 2010-06-30 17:13:20

::What a nice answer bro! Thank you very much.. have a nice day!

udayalkonline 2010-06-30 17:47:48

::One more question bro..Is it possible to get value of title also based on your answer??I have tried out based on your answer ..But still couldn't get the result! Any Idea..?? Thank in advance!

udayalkonline 2010-07-09 09:16:05

ansaurus

tags:

views:

answers:

How to get web content before visit that web page

related questions