ansaurus

Question

Answer 1

+3 A:

Try HttpUnit or HttpClient. Although the former is ostensibly for writing integration tests, it has a convenient API for programmatically iterating through a web page's links, with something like the following use of WebResponse.getLinks():

WebConversation wc = new WebConversation();
WebResponse resp = wc.getResponse("http://stackoverflow.com/questions/422970/");
WebLink[] links = resp.getLinks();
// Loop over array of links...

Peter Hilton 2009-01-08 02:05:45

Good options and I would recommend adding HtmlUnit to the list.

bmatthews68 2009-01-08 02:16:26

Don't abuse the purpose. HtmlUnit is a specialized library to do unit tests.

Adeel Ansari 2009-01-08 02:39:25

@Adeel: I don't know what HtmlUnit is but at the linked URL, HtmlUnit says it is a "browser for Java programs" in the first para and in third lists typical usages of "testing purposes or to retrieve information from web sites". I don't see what Adam wants as contradicting this typical usage.

Hemal Pandya 2009-01-08 05:55:29

Answer 2

+1 A:

You can use javacurl to get the site's html, and java DOM to analyze it.

Luca Matteis 2009-01-08 02:07:35

Answer 3

+3 A:

You can get low level and just request it with a socket. In java it looks like

// Arg[0] = Hostname
// Arg[1] = File like index.html
public static void main(String[] args) throws Exception {
    SSLSocketFactory factory = (SSLSocketFactory) SSLSocketFactory.getDefault();

    SSLSocket sslsock = (SSLSocket) factory.createSocket(args[0], 443);

    SSLSession session = sslsock.getSession();
    X509Certificate cert;
    try {
        cert = (X509Certificate) session.getPeerCertificates()[0];
    } catch (SSLPeerUnverifiedException e) {
        System.err.println(session.getPeerHost() + " did not present a valid cert.");
        return;
    }

    // Now use the secure socket just like a regular socket to read pages.
    PrintWriter out = new PrintWriter(sslsock.getOutputStream());
    out.write("GET " + args[1] + " HTTP/1.0\r\n\r\n");
    out.flush();

    BufferedReader in = new BufferedReader(new InputStreamReader(sslsock.getInputStream()));
    String line;
    String regExp = ".*<a href=\"(.*)\">.*";
    Pattern p = Pattern.compile( regExp, Pattern.CASE_INSENSITIVE );

    while ((line = in.readLine()) != null) {
        // Using Oscar's RegEx.
        Matcher m = p.matcher( line );  
        if( m.matches() ) {
            System.out.println( m.group(1) );
        }
    }

    sslsock.close();
}

Bernie Perez 2009-01-08 02:45:14

Nice way to learn HTTP also.

OscarRyz 2009-01-08 03:14:49

I like your answer the best and plan to try it out tomorrow. If it works I will accept it. Only question is how do i send a username and password?

Adam Lerman 2009-01-08 05:28:08

Hey Adam. This code connects to an HTTPS (Secure) site with SSL. Username/Passwords are site specific. Its almost like asking how to login to Bank of America and expect it to work with with WaMu's login thats different. I hope you still accept my answer as correct since its what you asked for.

Bernie Perez 2009-01-08 17:53:24

Answer 4

+2 A:

Probably you could get better resutls from Pete's or sktrdie options. Here's an additional way if you would like to know how to do it "by had"

I'm not very good at regex so in this case it returns the last link in a line. Well, it's a start.

import java.io.*;
import java.net.*;
import java.util.regex.*;

public class Links { 
    public static void main( String [] args ) throws IOException  { 

        URL url = new URL( args[0] );
        InputStream is = url.openConnection().getInputStream();

        BufferedReader reader = new BufferedReader( new InputStreamReader( is )  );

        String line = null;
        String regExp = ".*<a href=\"(.*)\">.*";
        Pattern p = Pattern.compile( regExp, Pattern.CASE_INSENSITIVE );

        while( ( line = reader.readLine() ) != null )  {
            Matcher m = p.matcher( line );  
            if( m.matches() ) {
                System.out.println( m.group(1) );
            }
        }
        reader.close();
    }
}

EDIT

Ooops I totally missed the "secure" part. Anyway I couldn't help it, I had to write this sample :P

OscarRyz 2009-01-08 03:12:04

I thought he said he needed Secure Access support. Does url.openConnection support SSL?

Bernie Perez 2009-01-08 03:21:59

Haha okay. Yeah I'll use your RegEx in my example if you don't mind.

Bernie Perez 2009-01-08 03:30:55

Not at all, go ahead. It doesn't work very well though.

OscarRyz 2009-01-08 03:55:33

Answer 5

+4 A:

Extremely similar questions:

matt b 2009-01-08 03:46:54

Answer 6

A:

There are two meanings of souce in a web context:

The HTML source: If you request a webpage by URL, you always get the HTML source code. In fact, there is nothing else that you could get from the URL. Webpages are always transmitted in source form, there is no such thing as a compiled webpage. And for what you are trying, this should be enough to fulfill your task.

Script Source: If the webpage is dynamically generated, then it is coded in some server side scripting language (like PHP, Ruby, JSP...). There also exists a source code at this level. But using a HTTP-connection you are not able to get this kind of source code. This is not a missing feature but completely by purpose.

Parsing: Having that said, you will need to somehow parse the HTML code. If you just need the links, using a RegEx (as Oscar Reyes showed) will be the most practical approach, but you could also write a simple parser "manually". It would be slow, more code... but works.

If you want to acess the code on a more logical level, parsing it to a DOM would be the way to go. If the code is valid XHTML you can just parse it to a org.w3c.dom.Document and do anything with it. If it is at least valid HTML you might apply some tricks to convert it to XHTML (in some rare cases, replacing <br> by <br/> and changing the doctype is enough) and use it as XML.

If it's not valid XML, you would need an HTML DOM parser. I've no idea if such a thing exists for Java and if it performs nice.

Brian Schimmel 2009-01-08 10:00:22

PS: Sorry I didn't go into the details of you do the specific tasks, but I had the feeling some basic things should be pointed out first. If you know exactly what to do, you will find out the details easily.

Brian Schimmel 2009-01-08 10:01:40

ansaurus

tags:

views:

answers:

Get source of website in java

related questions