tags:

views:

105

answers:

4

hey friends
how we can find out the no of hyperlinks in a page.
and how to find out what all are they?? i need to develop the stuff in plan java not in any frame work which means,by using
JAVA.NET.*; method,any scope?how can i do that?
can you guys give me a proper example??

i need to get all the links in the page and i need to save that in the database,all the links with domain name

+2  A: 

Best option is use some html parser library but if you dont want to use any such third party library you may try to do this by matching with regular expression using java's Pattern and Matcher classes from the regex package.

Edit Example:

String regex="\\b(?<=(href=\"))[^\"]*?(?=\")";
Pattern pattern = Pattern.compile(regex);

Matcher m = pattern.matcher(str_YourHtmlHere);
while(m.find()) {
  System.out.println("FOUND: " + m.group());
}

In above example is a simple basic regex which will find all links indicated by attribute href. You may have to enhance the regex for correctly handling all scenarios such as href with url in single quote etc.

Gopi
@Gopi: thks for the info,can you gave me any examples?
Alex Mathew
Edited to add example
Gopi
can u gave me full example, like adding import and so on
Alex Mathew
+4  A: 

You can use the javax.swing.text.html and javax.swing.text.html.parser packages to achieve this:

import java.io.*;
import java.net.URL;
import java.util.Enumeration;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class Test {
   public static void main(String[] args) throws Exception  {
      Reader r = null;

      try   {
         URL u = new URL(args[0]);
         InputStream in = u.openStream();
         r = new InputStreamReader(in);

         ParserDelegator hp = new ParserDelegator();
         hp.parse(r, new HTMLEditorKit.ParserCallback() {
            public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
               // System.out.println(t);
               if(t == HTML.Tag.A)  {
                  Enumeration attrNames = a.getAttributeNames();
                  StringBuilder b = new StringBuilder();
                  while(attrNames.hasMoreElements())    {
                      Object key = attrNames.nextElement();
                      if("href".equals(key.toString())) {
                          System.out.println(a.getAttribute(key));
                      }
                  }
               }
            }
         }, true);
      }finally {
         if(r != null)  {
            r.close();
         }
      }
   }
}

Compile and call it this way:

java Test http://www.oracle.com/technetwork/java/index.html
naikus
@Naikus : its not working, its showing "Found the A Tag!!! a" , its not showing the HTML
Alex Mathew
@Naikus : Sorry Not HTML ,not showing the link
Alex Mathew
@Alex Mathew I've updated the code in my answer to show the href of the "a" tag
naikus
@Naikus : its will just print like "product.html" ,not "http://www.aaaa.com/products.html", how can we make the output like that???any help?
Alex Mathew
You will have to generate the entire URL using HREF attribute and the document's URL
naikus
@Naikus: how can i do that?can u rewrite the example?
Alex Mathew
@Alex Mathew I think you should put in some effort in finding that. I've shown you a way to do it and now its up to you to make modifications to it.
naikus
Can you please say where should i do the work, i am not getting that
Alex Mathew
+2  A: 

Getting Links in an HTML Document

camickr
+3  A: 

Try using the jsoup library.

Download the project jar and compile this code snippet:

    Document doc = Jsoup.parse(new URL("http://www.bits4beats.it/"), 2000);

    Elements resultLinks = doc.select("a");
    System.out.println("number of links: " + resultLinks.size());
    for (Element link : resultLinks) {
        System.out.println();
        String href = link.attr("href");
        System.out.println("Title: " + link.text());
        System.out.println("Url: " + href);
    }

The code prints the numbers of hypertext elements in a html page and infos about them.

Impiastro
This is definitely the way to go. Use a real HTML parser/extractor.
BalusC