views:

459

answers:

1

I am working with apache http client 4 for all of my web accesses. This means that every query that I need to do has to pass the URI syntax checks. One of the sites that I am trying to access uses UNICODE as the url GET params encoding, i.e:

http://maya.tase.co.il/bursa/index.asp?http://maya.tase.co.il/bursa/index.asp?view=search&company_group=147&srh_txt=%u05E0%u05D9%u05D1&arg_comp=&srh_from=2009-06-01&srh_until=2010-02-16&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press=

(the param "srh_txt=%u05E0%u05D9%u05D1" encodes srh_txt=ניב in UNICODE)

The problem is that URI doesn't support UNICODE encoding(it only supports UTF-8) The really big issue here, is that this site expect it's params to be encoded in UNICODE, so any attempts to convert the url using String.format("http://...srh_txt=%s&...",URLEncoder.encode( "ניב" , "UTF8")) results in a url which is legal and can be used to construct a URI but the site response to it with an error message, since it's not the encoding that it expects.

by the way URL object can be created and even used to connect to the web site using the non converted url. Is there any way of creating URI in non UTF-8 encoding? Is there any way of working with apache httpclient 4 with regular URL(and not URI)?

thanks, Niv

A: 

(the param "srh_txt=%u05E0%u05D9%u05D1" encodes srh_txt=ניב in UNICODE)

It doesn't really. That's not URL-encoding and the sequence %u is invalid in a URL.

%u05E0%u05D9%u05D1" encodes ניב only in JavaScript's oddball escape syntax. escape is the same as URL-encoding for all ASCII characters except for +, but the %u#### escapes it produces for Unicode characters are completely of its own invention.

(One should, in general, never use escape. Using encodeURIComponent instead produces the correct URL-encoded UTF-8, ניב=%D7%A0%D7%99%D7%91.)

If a site requires %u#### sequences in its query string, it is very badly broken.

Is there any way of creating URI in non UTF-8 encoding?

Yes, URIs may use any character encoding you like. It is conventionally UTF-8; that's what IRI requires and what browsers will usually submit if the user types non-ASCII characters into the address bar, but URI itself concerns itself only with bytes.

So you could convert ניב to %F0%E9%E1. There would be no way for the web app to tell that those bytes represented characters encoded in code page 1255 (Hebrew, similar to ISO-8859-8). But it does appear to work, on the link above, which the UTF-8 version does not. Oh dear!

bobince
Thanks a lot for the answerI am running in my code on a list of urls and sometimes getting URISyntaxExceptions and trying to find ways to work around some of them(the exception in the last post was just one of the urls in the list)I think that this issue will be considered as "not workaroundable" in code(without putting lots of time in to it)I'll just go on to the next one.Thanks againNiv
Niv