views:

150

answers:

5

Hello everybody,

i'm trying to get an entire WebPage through a URLConnection.

What's the most efficient way to do this?

I'm doing this already:

URL url = new URL("http://www.google.com/");
URLConnection connection;
connection = url.openConnection();
InputStream in = connection.getInputStream();        
BufferedReader bf = new BufferedReader(new InputStreamReader(in));
StringBuffer html = new StringBuffer();
String line = bf.readLine();
while(line!=null){
    html.append(line);
    line = bf.readLine();
}
bf.close();

html has the entire HTML page.

+8  A: 

I think this is the best way. The size of the page is fixed ("it is what it is"), so you can't improve on memory. Perhaps you can compress the contents once you have them, but they aren't very useful in that form. I would imagine that eventually you'll want to parse the HTML into a DOM tree.

Anything you do to parallelize the reading would overly complicate the solution.

I'd recommend using a StringBuilder with a default size of 2048 or 4096.

Why are you thinking that the code you posted isn't sufficient? You sound like you're guilty of premature optimization.

Run with what you have and sleep at night.

duffymo
Haha, thanks @duffymo. I was just wondering if there was a better way. Thanks.
santiagobasulto
+2  A: 

You can try using commons-io from apache (http://commons.apache.org/io/api-release/org/apache/commons/io/IOUtils.html)

new String(IOUtils.toCharArray(connection.getInputStream()))
krico
+1; you'll save a lot of code, and a lot of potential errors, just by using the very stable Apache Commons for this.
Dean J
String data = IOUtils.toString(connection.getInputStream()); seems easier
Steven
+1  A: 

Your approach looks pretty good, however you can make it somewhat more efficient by avoiding the creation of intermediate String objects for each line.

The way to do this is to read directly into a temporary char[] buffer.

Here is a slightly modified version of your code that does this (minus all the error checking, exception handling etc. for clarity):

        URL url = new URL("http://www.google.com/");
        URLConnection connection;
        connection = url.openConnection();
        InputStream in = connection.getInputStream();        
        BufferedReader bf = new BufferedReader(new InputStreamReader(in));
        StringBuffer html = new StringBuffer();

        char[] charBuffer = new char[4096];
        int count=0;

        do {
            count=bf.read(charBuffer, 0, 4096);
            if (count>=0) html.append(charBuffer,0,count);
        } while (count>0);
        bf.close();

For even more performance, you can of course do little extra things like pre-allocating the character array and StringBuffer if this code is going to be called frequently.

mikera
p.s. you may also be able to read into the char[] buffer directly from the InputStreamReader. Haven't tested whether this works better or not, but worth considering as you might eliminate an unnecessary layer of buffering.
mikera
Thanks! I'll try it. Just one question. Why 4096?
santiagobasulto
Hmmm 4096 was just an "educated guess". Too big would waste memory, too small would require too many separate iterations/reads for large files. You could always experiment to see if you can find a better value for your environment if you like.
mikera
html.append(charBuffer, 0, count);
EJP
@EJP good spot - I somehow missed that overload in the JavaDocs! Fixed answer to reflect.
mikera
It's still wrong. Consider what happens when count < 0.
EJP
+3  A: 

What do you want to do with the obtained HTML? Parse it? It may be good to know that a bit decent HTML parser can already have a constructor or method argument which takes straight an URL or InputStream so that you don't need to worry about streaming performance like that.

Assuming that all you want to do is described in your previous question, with for example Jsoup you could obtain all those news links extraordinary easy like follows:

Document document = Jsoup.connect("http://news.google.com.ar/nwshp?hl=es&amp;tab=wn").get();
Elements newsLinks = document.select("h2.title a:eq(0)");
for (Element newsLink : newsLinks) {
    System.out.println(newsLink.attr("href"));
}

This yields the following after only a few seconds:

http://www.infobae.com/mundo/541259-100970-0-Pinera-confirmo-que-el-rescate-comenzara-las-20-y-durara-24-y-48-horas
http://www.lagaceta.com.ar/nota/403112/Argentina/Boudou-disculpo-con-DAIA-pero-volvio-cuestionar-medios.html
http://www.abc.es/agencias/noticia.asp?noticia=550415
http://www.google.com/hostednews/epa/article/ALeqM5i6x9rhP150KfqGJvwh56O-thi4VA?docId=1383133
http://www.abc.es/agencias/noticia.asp?noticia=550292
http://www.univision.com/contentroot/wirefeeds/noticias/8307387.shtml
http://noticias.terra.com.ar/internacionales/ecuador-apoya-reclamo-argentino-por-ejercicios-en-malvinas,3361af2a712ab210VgnVCM4000009bf154d0RCRD.html
http://www.infocielo.com/IC/Home/index.php?ver_nota=22642
http://www.larazon.com.ar/economia/Cristina-Fernandez-Censo-indispensable-pais_0_176100098.html
http://www.infobae.com/finanzas/541254-101275-0-Energeticas-llevaron-la-Bolsa-portena-ganancias
http://www.telam.com.ar/vernota.php?tipo=N&idPub=200661&id=381154&dis=1&sec=1
http://www.ambito.com/noticia.asp?id=547722
http://www.canal-ar.com.ar/noticias/noticiamuestra.asp?Id=9469
http://www.pagina12.com.ar/diario/cdigital/31-154760-2010-10-12.html
http://www.lanacion.com.ar/nota.asp?nota_id=1314014
http://www.rpp.com.pe/2010-10-12-ganador-del-pulitzer-destaca-nobel-de-mvll-noticia_302221.html
http://www.lanueva.com/hoy/nota/b44a7553a7/1/79481.html
http://www.larazon.com.ar/show/sdf_0_176100096.html
http://www.losandes.com.ar/notas/2010/10/12/batista-siento-comodo-dieron-respaldo-520595.asp
http://deportes.terra.com.ar/futbol/los-rumores-empiezan-a-complicar-la-vida-de-river-y-vuelve-a-sonar-gallego,a24483b8702ab210VgnVCM20000099f154d0RCRD.html
http://www.clarin.com/deportes/futbol/Exigieron-Roman-regreso-Huracan_0_352164993.html
http://www.el-litoral.com.ar/leer_noticia.asp?idnoticia=146622
http://www.nuevodiarioweb.com.ar/nota/181453/Locales/C%C3%A1ncer_mama:_200_casos_a%C3%B1o_Santiago.html
http://www.ultimahora.com/notas/367322-Funcionarios-sanitarios-capacitaran-sobre-cancer-de-mama
http://www.lanueva.com/hoy/nota/65092f2044/1/79477.html
http://www.infobae.com/policiales/541220-101275-0-Se-suspendio-la-declaracion-del-marido-Fernanda-Lemos
http://www.clarin.com/sociedad/educacion/titulo_0_352164863.html

Did someone already said that regex is absolutely the wrong tool to parse HTML? ;)

See also:

BalusC
Thanks @BalusC You where straight to the point. I didn't say that before becouse i didn't want to mess the thinks up. And becouse was interest in the "general" answer. But that really helps me out. I've been benchmarking differents parsers. The Swing HTMLEditorKit seems to be the best one. Have you any expirience with that? Do you think JSoup is better? Thanks for your answer!!
santiagobasulto
I find that Jsoup has a much more slick API. I like it. Selecting HTML elements in Java using jQuery-like selectors. Performance difference is IMO negligible.
BalusC
@BalusC Thanks very much. Your right about the API. It's really neat (i was using HTMLParser too, and is not that simple). Just one last question. Is there any difference (in performance) between JSoup.connect() and JSoup.parse()? In the parse() docs says that should be used connect().
santiagobasulto
Ah right, that was new in the 1.3 API. Yes, use it instead. I'll update the answer accordingly.
BalusC
Great. I've read your blog. Are you in Venezuela? I'm in Argentina. Saludos Amigo! MUchas gracias!
santiagobasulto
No, in [Curaçao](http://en.wikipedia.org/wiki/Curacao), an island in the Caribbean near Venezuela. Saludos!
BalusC
@BalusC i'll keep using the HTMLEditorKit. It's way faster (i've benchmarked several parsers). JSoup takes 5000 milis to get all thos links whereas HTMLEditorKit takes 1000. Thanks for your help!!
santiagobasulto
+1  A: 

There are some technical considerations. You may wish to use HTTPURLConnection instead of URLConnection.

HTTPURLConnection supports chunked transfer encoding, which allows you to process the data in chunks, rather than buffering all of the content before you start doing work. This can lead to an improved user experience.

Also, HTTPURLConnection supports persistent connections. Why close that connection if you're going to request another resource right away? Keeping the TCP connection open with the web server allows your application to quickly download multiple resources without spending the overhead (latency) of establishing a new TCP connection for each resource.

Tell the server that you support gzip and wrap a BufferedReader around GZIPInputStream if the response header says the content is compressed.

Marcus Adams
Great! I'll take into account. Really good hints.
santiagobasulto