tags:

views:

44

answers:

4

hi, how to get web page title for given url using html parser. It is possible to get using regular expression,But I want to get that using html parser. I'm working on elipse IDE in java environment. I have tried out using following code segment .But still couldn't get the result.

Any idea..?

Thank in advance!

import org.htmlparser.Node;

import org.htmlparser.Parser;

import org.htmlparser.util.NodeList;

import org.htmlparser.util.ParserException;

import org.htmlparser.tags.TitleTag;

public class TestHtml {

public static void main(String... args) {
    Parser parser = new Parser();     
    try {
        parser.setResource("http://www.yahoo.com/");
        NodeList list = parser.parse(null);
        Node node = list.elementAt(0);

        if (node instanceof TitleTag) {
           TitleTag title = (TitleTag) node;


            System.out.println(title.getText());

        }

    } catch (ParserException e) {
        e.printStackTrace();
    }
}

}

A: 

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

Smart you don't want to use the Regex.

To use an HTML parser, we need to know which language you're using. Since you say you're "on eclipse", I'm going to assume Java.

Take a look at http://www.ibm.com/developerworks/xml/library/x-domjava/ for a description, overview, and various viewpoints.

Borealid
A: 

Well - assuming you're using java, but there is the equivalent in most of the languages - you can use a SAX parser (such as TagSoup which transform any html to xhtml) and in your handler you can do :

public class MyHandler extends org.xml.sax.helpers.DefaultHandler {
    boolean readTitle = false;
    StringBuilder title = new StringBuilder();

    public void startElement(String uri, String localName, String name,
                Attributes attributes) throws SAXException {
        if(localName.equals("title") {
            readTitle = true;
        }
    }

    public void endElement(String uri, String localName, String name)
            throws SAXException {
        if(localName.equals("title") {
            readTitle = false;
        }
    }

    public void characters(char[] ch, int start, int length)
            throws SAXException {
        if(readTitle) title.append(new String(ch, start, length));
    }
}

and you use it in your parser (example with tagsoup) :

org.ccil.cowan.tagsoup.Parser parser = new Parser();
MyHandler handler = new MyHander();
parser.setContentHandler(handler);
parser.parse(an input stream to your html file);
return handler.title.toString();
Vinze
I have tried out with following code segment.But still I couldn't get the result.public class TestParser{ public static void main(String... args) { try{ Parser parser = new Parser(); parser.setResource("http://www.youtube.com"); NodeList list = parser.parse(null); Node node = list.elementAt(0); if(node instanceof TitleTag){ TitleTag title = ( TitleTag) node ; System.out.println(title.getText()); } } catch(ParserException e){ e.printStackTrace(); } }
udayalkonline
you should put this in your question and define which language and which librarie(s) you use (and maybe add the corresponding tags), it would be more efficient to have an answer if the question is less vague...
Vinze
::I have edited my question and If you can give any idea or correction it would better for me..thanx!
udayalkonline
added an other answer that correspond to the newly defined question.
Vinze
+1  A: 

According to your (redefined) question, the problem is that you only check the first node Node node = list.elementAt(0); while you should iterate over the list to find the title (which is not the first). You could also use a NodeFilter for your parse() to only return the TitleTag and then the title would be in the first and you wouldn't have to iterate.

Vinze
::Yep..I know But still I couldn't find the way should I follow to filter the TitleTag!Any idea..?? thnx!
udayalkonline
Never used that library but must be classic... something like new NodeFilter() { public boolean accept(Node node) { return node instanceof TitleTag; } }
Vinze
:: Thank alot bro..got the result based on your answer..have a nice day!
udayalkonline
A: 

BTW there is already a very simple title extract that ships with HTMLParser. You can use that : http://htmlparser.sourceforge.net/samples.html

The method to run it is (from within the HtmlParser code base) : Run :

bin/parser http://website_url TITLE

or run

java -jar <path to htmlparser.jar> http://website_url TITLE

or from your code call the method

org.htmlparser.Parser.main(String[] args)

with the parameters new String[] {"<website url>", "TITLE"}

madhurtanwani