ansaurus

Question

How to find hyperlink in a webpage using java?

Answer 1

+2 A:

Best option is use some html parser library but if you dont want to use any such third party library you may try to do this by matching with regular expression using java's Pattern and Matcher classes from the regex package.

Edit Example:

String regex="\\b(?<=(href=\"))[^\"]*?(?=\")";
Pattern pattern = Pattern.compile(regex);

Matcher m = pattern.matcher(str_YourHtmlHere);
while(m.find()) {
  System.out.println("FOUND: " + m.group());
}

In above example is a simple basic regex which will find all links indicated by attribute href. You may have to enhance the regex for correctly handling all scenarios such as href with url in single quote etc.

Gopi 2010-08-01 18:12:57

@Gopi: thks for the info,can you gave me any examples?

Alex Mathew 2010-08-01 18:15:09

Edited to add example

Gopi 2010-08-01 18:44:31

can u gave me full example, like adding import and so on

Alex Mathew 2010-08-02 12:35:52

Answer 2

+4 A:

You can use the javax.swing.text.html and javax.swing.text.html.parser packages to achieve this:

import java.io.*;
import java.net.URL;
import java.util.Enumeration;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class Test {
   public static void main(String[] args) throws Exception  {
      Reader r = null;

      try   {
         URL u = new URL(args[0]);
         InputStream in = u.openStream();
         r = new InputStreamReader(in);

         ParserDelegator hp = new ParserDelegator();
         hp.parse(r, new HTMLEditorKit.ParserCallback() {
            public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
               // System.out.println(t);
               if(t == HTML.Tag.A)  {
                  Enumeration attrNames = a.getAttributeNames();
                  StringBuilder b = new StringBuilder();
                  while(attrNames.hasMoreElements())    {
                      Object key = attrNames.nextElement();
                      if("href".equals(key.toString())) {
                          System.out.println(a.getAttribute(key));
                      }
                  }
               }
            }
         }, true);
      }finally {
         if(r != null)  {
            r.close();
         }
      }
   }
}

Compile and call it this way:

java Test http://www.oracle.com/technetwork/java/index.html

naikus 2010-08-01 18:49:35

@Naikus : its not working, its showing "Found the A Tag!!! a" , its not showing the HTML

Alex Mathew 2010-08-02 12:27:30

@Naikus : Sorry Not HTML ,not showing the link

Alex Mathew 2010-08-02 12:35:16

@Alex Mathew I've updated the code in my answer to show the href of the "a" tag

naikus 2010-08-02 13:20:38

@Naikus : its will just print like "product.html" ,not "http://www.aaaa.com/products.html", how can we make the output like that???any help?

Alex Mathew 2010-08-02 16:05:40

You will have to generate the entire URL using HREF attribute and the document's URL

naikus 2010-08-02 18:06:09

@Naikus: how can i do that?can u rewrite the example?

Alex Mathew 2010-08-03 06:28:04

@Alex Mathew I think you should put in some effort in finding that. I've shown you a way to do it and now its up to you to make modifications to it.

naikus 2010-08-03 06:46:04

Can you please say where should i do the work, i am not getting that

Alex Mathew 2010-08-03 06:46:41

Answer 3

+2 A:

Getting Links in an HTML Document

camickr 2010-08-01 19:02:24

Answer 4

+3 A:

Try using the jsoup library.

Download the project jar and compile this code snippet:

    Document doc = Jsoup.parse(new URL("http://www.bits4beats.it/"), 2000);

    Elements resultLinks = doc.select("a");
    System.out.println("number of links: " + resultLinks.size());
    for (Element link : resultLinks) {
        System.out.println();
        String href = link.attr("href");
        System.out.println("Title: " + link.text());
        System.out.println("Url: " + href);
    }

The code prints the numbers of hypertext elements in a html page and infos about them.

Impiastro 2010-08-02 08:15:05

This is definitely the way to go. Use a real HTML parser/extractor.

BalusC 2010-08-03 19:27:07

ansaurus

tags:

views:

answers:

How to find hyperlink in a webpage using java?

related questions