views:

155

answers:

0

Hi all I've simply used the following program on the url below

http://jericho.htmlparser.net/samples/console/src/ExtractText.java

My goal is to be able to extract the main body text, to be able to summarize it and present the summarized text as output to the user.

My problem is that, I'm not sure how I'd modify the above program to only get the required text from the webpage, without the links or any other information.

Again, I'd really appreciate any help I could get.

Thanks in advance

I kinda manipulated a bit of the code as follows:-

    System.out.println("\nThis time extend the TextExtractor class to only include text from P elements");
    TextExtractor textExtractor1=new TextExtractor(source) {
        public boolean excludeElement(StartTag startTag) {
            return !(startTag.getName()==HTMLElementName.P); //by using the not, I tried to imply excluding all the other elements except those enclosed in p tags

        }
    };
    System.out.println(textExtractor1.setIncludeAttributes(true).toString());

but it really did not do anything. Could some one help me out here?