tags:

views:

979

answers:

6

I would like to use java to get the source of a website (secure) and then parse that website for links that are in it. I have found how to connect to that url, but then how can i easily get just the source, preferraby as the DOM Document oso that I could easily get the info I want.

Or is there a better way to connect to https site, get the source (which I neet to do to get a table of data...its pretty simple) then those links are files i am going to download.

I wish it was FTP but these are files stored on my tivo (i want to programmatically download them to my computer(

+3  A: 

Try HttpUnit or HttpClient. Although the former is ostensibly for writing integration tests, it has a convenient API for programmatically iterating through a web page's links, with something like the following use of WebResponse.getLinks():

WebConversation wc = new WebConversation();
WebResponse resp = wc.getResponse("http://stackoverflow.com/questions/422970/");
WebLink[] links = resp.getLinks();
// Loop over array of links...
Peter Hilton
Good options and I would recommend adding HtmlUnit to the list.
bmatthews68
Don't abuse the purpose. HtmlUnit is a specialized library to do unit tests.
Adeel Ansari
@Adeel: I don't know what HtmlUnit is but at the linked URL, HtmlUnit says it is a "browser for Java programs" in the first para and in third lists typical usages of "testing purposes or to retrieve information from web sites". I don't see what Adam wants as contradicting this typical usage.
Hemal Pandya
+1  A: 

You can use javacurl to get the site's html, and java DOM to analyze it.

Luca Matteis
+3  A: 

You can get low level and just request it with a socket. In java it looks like

// Arg[0] = Hostname
// Arg[1] = File like index.html
public static void main(String[] args) throws Exception {
    SSLSocketFactory factory = (SSLSocketFactory) SSLSocketFactory.getDefault();

    SSLSocket sslsock = (SSLSocket) factory.createSocket(args[0], 443);

    SSLSession session = sslsock.getSession();
    X509Certificate cert;
    try {
        cert = (X509Certificate) session.getPeerCertificates()[0];
    } catch (SSLPeerUnverifiedException e) {
        System.err.println(session.getPeerHost() + " did not present a valid cert.");
        return;
    }

    // Now use the secure socket just like a regular socket to read pages.
    PrintWriter out = new PrintWriter(sslsock.getOutputStream());
    out.write("GET " + args[1] + " HTTP/1.0\r\n\r\n");
    out.flush();

    BufferedReader in = new BufferedReader(new InputStreamReader(sslsock.getInputStream()));
    String line;
    String regExp = ".*<a href=\"(.*)\">.*";
    Pattern p = Pattern.compile( regExp, Pattern.CASE_INSENSITIVE );

    while ((line = in.readLine()) != null) {
        // Using Oscar's RegEx.
        Matcher m = p.matcher( line );  
        if( m.matches() ) {
            System.out.println( m.group(1) );
        }
    }

    sslsock.close();
}
Bernie Perez
Nice way to learn HTTP also.
OscarRyz
I like your answer the best and plan to try it out tomorrow. If it works I will accept it. Only question is how do i send a username and password?
Adam Lerman
Hey Adam. This code connects to an HTTPS (Secure) site with SSL. Username/Passwords are site specific. Its almost like asking how to login to Bank of America and expect it to work with with WaMu's login thats different. I hope you still accept my answer as correct since its what you asked for.
Bernie Perez
+2  A: 

Probably you could get better resutls from Pete's or sktrdie options. Here's an additional way if you would like to know how to do it "by had"

I'm not very good at regex so in this case it returns the last link in a line. Well, it's a start.

import java.io.*;
import java.net.*;
import java.util.regex.*;

public class Links { 
    public static void main( String [] args ) throws IOException  { 

        URL url = new URL( args[0] );
        InputStream is = url.openConnection().getInputStream();

        BufferedReader reader = new BufferedReader( new InputStreamReader( is )  );

        String line = null;
        String regExp = ".*<a href=\"(.*)\">.*";
        Pattern p = Pattern.compile( regExp, Pattern.CASE_INSENSITIVE );

        while( ( line = reader.readLine() ) != null )  {
            Matcher m = p.matcher( line );  
            if( m.matches() ) {
                System.out.println( m.group(1) );
            }
        }
        reader.close();
    }
}

EDIT

Ooops I totally missed the "secure" part. Anyway I couldn't help it, I had to write this sample :P

OscarRyz
I thought he said he needed Secure Access support. Does url.openConnection support SSL?
Bernie Perez
Haha okay. Yeah I'll use your RegEx in my example if you don't mind.
Bernie Perez
Not at all, go ahead. It doesn't work very well though.
OscarRyz
A: 

There are two meanings of souce in a web context:

The HTML source: If you request a webpage by URL, you always get the HTML source code. In fact, there is nothing else that you could get from the URL. Webpages are always transmitted in source form, there is no such thing as a compiled webpage. And for what you are trying, this should be enough to fulfill your task.

Script Source: If the webpage is dynamically generated, then it is coded in some server side scripting language (like PHP, Ruby, JSP...). There also exists a source code at this level. But using a HTTP-connection you are not able to get this kind of source code. This is not a missing feature but completely by purpose.

Parsing: Having that said, you will need to somehow parse the HTML code. If you just need the links, using a RegEx (as Oscar Reyes showed) will be the most practical approach, but you could also write a simple parser "manually". It would be slow, more code... but works.

If you want to acess the code on a more logical level, parsing it to a DOM would be the way to go. If the code is valid XHTML you can just parse it to a org.w3c.dom.Document and do anything with it. If it is at least valid HTML you might apply some tricks to convert it to XHTML (in some rare cases, replacing <br> by <br/> and changing the doctype is enough) and use it as XML.

If it's not valid XML, you would need an HTML DOM parser. I've no idea if such a thing exists for Java and if it performs nice.

Brian Schimmel
PS: Sorry I didn't go into the details of you do the specific tasks, but I had the feeling some basic things should be pointed out first. If you know exactly what to do, you will find out the details easily.
Brian Schimmel