views:

94

answers:

2

I'm tryng to download a web page in java with the following:

URL url = new URL("www.jksfljasdlfas.com");
FIle to = new File("/home/test/test.html");

Reader in = new InputStreamReader(url.openStream(), "UTF-8");
Writer out = new OutputStreamWriter(new FileOutputStream(to), "UTF-8");

int c;
while((c = in.read()) != -1){
    out.write(c);
}
in.close();
out.close();

I download the page and some character are replaced by entities:
this:
<a href="http://www.generation276.org/film/?m=200812&amp;paged=2" >Pagina successiva &raquo;</a>
become this:
<a href="http://www.generation276.org/film/?m=200812&amp;#038;paged=2" >Pagina successiva &raquo;</a>
Downloading the same page with Chrome, the & remains &.
I'm new in Charset/encoding; can anybody understand the probem?

+4  A: 

The Java part is working perfectly fine.

Chrome is tricking you there. In FireFox, when I select View -> Page Source, I see this:

<a href="http://www.generation276.org/film/?m=200812&amp;#038;paged=3" >
Pagina successiva &raquo;</a>

while with FireBug / Inspect Element I see this:

<a href="http://www.generation276.org/film/?m=200812&amp;paged=3" style="">
Pagina successiva »</a>

and it copies to the clipboard as this:

<a href="http://www.generation276.org/film/?m=200812&amp;amp;paged=3" style="">
Pagina successiva »</a>

Browsers don't always show you what's really there.


The second part of your question is identical to this previous Question:

Java: How to decode HTML character entities in Java like HttpUtility.HtmlDecode?

And hence the answer is also the same:

Use StringEscapeUtils.unescapeHTML(String) from the Apache Commons / Lang project.

seanizer
So you say that the java code works. Ok, how can I "unescape" the url? I mean in a general way. Thanks
s.susini
see my updated answer
seanizer
+2  A: 

The actual source of that page does say:

<a href="http://www.generation276.org/film/?m=200812&amp;#038;paged=2" >Pagina successiva &raquo;</a>

and this is perfectly fine. &#038; is a valid character reference for a literal ampersand character in HTML, although the entity reference &amp; is generally more common.

<a href="http://www.generation276.org/film/?m=200812&amp;paged=2" >Pagina successiva &raquo;</a>

This is invalid HTML.

When you save ‘HTML only’, Chrome saves the original HTML source without change. When you save ‘Complete’, it has to re-write the page to change references to other resources.

Unfortunately the serialisation process involved in this appears to have a bug in failing to &-escape the ampersands in the URL. Whilst browsers typically let you get away with this, it will break (mangling your URL) if the word to the right of the ampersand happens to make a valid HTML entity name or character reference.

Other places where Chrome serialises attribute values, such as innerHTML, do not suffer from this rather poor bug.

ETA:

I have to "unescape" the &... how can I do?

If you try to scrape information out of the source using regex you'd have to decode manually using HTML decoder. There isn't one built-in to Java so you would need a third-party tool such as that from Apache Commons as linked by seanizer.

However, scraping with regex is crude and unreliable. I would strongly suggest using an HTML parser to load the file and pick out the data you want. It will deal with decoding attribute values and text content.

bobince
s.susini