views:

89

answers:

1

hi, how to get description/content of web page for given URL. (Something like Google gives the short description of each resulting link). I want to do this in my jsp page.

Thank in advance!

+4  A: 

Idea: Open the URL as a stream, then HTML-parse the String in its description meta tag.

Grab URL content:

URL url = new URL("http://www.url-to-be-parsed.com/page.html");
    BufferedReader in = new BufferedReader(
                new InputStreamReader(
                url.openStream()));

Will need to tweak the above code depending on what your HTML parser library requires (a stream, strings, etc).

HTML-Parse the tags:

<meta name="description" content="This is a place where webmasters can put a description about this web page" />

You might also be interested in grabbing the title of that page:

<title>This is the title of the page!</title>

Caution: Regular expressions do not seem to work reliably on HTML documents, so a HTML-parser is better.

An example with HTML Parser:

  1. Use HasAttributeFilter to filter by tags that have name="description" attribute
  2. try a Node ---> MetaTag casting
  3. Get the content using MetaTag.getAttribute()

Code:

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.tags.MetaTag;

public class HTMLParserTest {
    public static void main(String... args) {
        Parser parser = new Parser();
        //<meta name="description" content="Some texte about the site." />
        HasAttributeFilter filter = new HasAttributeFilter("name", "description");
        try {
            parser.setResource("http://www.youtube.com");
            NodeList list = parser.parse(filter);
            Node node = list.elementAt(0);

            if (node instanceof MetaTag) {
                MetaTag meta = (MetaTag) node;
                String description = meta.getAttribute("content");

                System.out.println(description);
                // Prints: "YouTube is a place to discover, watch, upload and share videos."
            }

        } catch (ParserException e) {
            e.printStackTrace();
        }
    }

}

Considerations:

If this is done in a JSP each time the page is loaded, you might get a slowdown due to the network I/O to the URL. Even worse if you do this each time on-the-fly for a page of yours that has many URL links in it, then the slowdown could be massive due to the sequential operation of n URLs. Maybe you can store this information in a database and refresh them as needed instead of doing in it on-the-fly in the JSPs.

Bakkal
::Thank you very much for your reply.I want to extract the content information of the meta tag.I'm using html parser (http://htmlparser.sourceforge.net/samples.html).could you please help me..
udayalkonline
There you go. Took me a while to make my way around their API. Seems to work fine as it is. As I will be using this too, I'll update if I find more efficient ways.
Bakkal
::What a nice answer bro! Thank you very much.. have a nice day!
udayalkonline
::One more question bro..Is it possible to get value of title also based on your answer??I have tried out based on your answer ..But still couldn't get the result! Any Idea..?? Thank in advance!
udayalkonline