ansaurus

Question

Download a web page without character replacement

Answer 1

+4 A:

The Java part is working perfectly fine.

Chrome is tricking you there. In FireFox, when I select View -> Page Source, I see this:

<a href="http://www.generation276.org/film/?m=200812&amp;#038;paged=3" >
Pagina successiva &raquo;</a>

while with FireBug / Inspect Element I see this:

<a href="http://www.generation276.org/film/?m=200812&amp;paged=3" style="">
Pagina successiva »</a>

and it copies to the clipboard as this:

<a href="http://www.generation276.org/film/?m=200812&amp;amp;paged=3" style="">
Pagina successiva »</a>

Browsers don't always show you what's really there.

The second part of your question is identical to this previous Question:

Java: How to decode HTML character entities in Java like HttpUtility.HtmlDecode?

And hence the answer is also the same:

Use StringEscapeUtils.unescapeHTML(String) from the Apache Commons / Lang project.

seanizer 2010-09-15 23:49:41

So you say that the java code works. Ok, how can I "unescape" the url? I mean in a general way. Thanks

s.susini 2010-09-16 00:59:53

see my updated answer

seanizer 2010-09-16 06:40:07

Answer 2

+2 A:

The actual source of that page does say:

<a href="http://www.generation276.org/film/?m=200812&amp;#038;paged=2" >Pagina successiva &raquo;</a>

and this is perfectly fine. & is a valid character reference for a literal ampersand character in HTML, although the entity reference & is generally more common.

<a href="http://www.generation276.org/film/?m=200812&amp;paged=2" >Pagina successiva &raquo;</a>

This is invalid HTML.

When you save ‘HTML only’, Chrome saves the original HTML source without change. When you save ‘Complete’, it has to re-write the page to change references to other resources.

Unfortunately the serialisation process involved in this appears to have a bug in failing to &-escape the ampersands in the URL. Whilst browsers typically let you get away with this, it will break (mangling your URL) if the word to the right of the ampersand happens to make a valid HTML entity name or character reference.

Other places where Chrome serialises attribute values, such as innerHTML, do not suffer from this rather poor bug.

ETA:

I have to "unescape" the &... how can I do?

If you try to scrape information out of the source using regex you'd have to decode manually using HTML decoder. There isn't one built-in to Java so you would need a third-party tool such as that from Apache Commons as linked by seanizer.

However, scraping with regex is crude and unreliable. I would strongly suggest using an HTML parser to load the file and pick out the data you want. It will deal with decoding attribute values and text content.

bobince 2010-09-15 23:51:17

s.susini 2010-09-16 01:02:16

ansaurus

tags:

views:

answers:

Download a web page without character replacement

related questions