views:

163

answers:

3

Hi there,

My ColdFusion (MX7 on IIS 6) site has search functionality which appends the search term to the URL e.g. http://www.example.com/search.cfm/searchterm.

The problem I'm running into is this is a multilingual site, so the search term may be in another language e.g. القاهرة leading to a search URL such as http://www.example.com/search.cfm/القاهرة

The problem is when I come to retrieve the search term from the URL. I'm using cgi.PATH_INFO to retrieve the path of the search page and the search term and extracting the search term from this e.g. /search.cfm/searchterm however, when unicode characters are used in the search they are converted to question marks e.g. /search.cfm/??????.

These appear actual question marks, rather than the browser not being able to format unicode characters, or them being mangled on output.

I can't find any information about whether ColdFusion supports unicode in the URL, or how I can go about resolving this and getting hold of the complete URL in some way - does anyone have any ideas?

Cheers,

Tom

Edit: Further research has lead me to believe the issue may related to IIS rather than ColdFusion, but my original query still stands.

Further edit

The result of GetPageContext().GetRequest().GetRequestUrl().ToString() is http://www.example.com/search.cfm/searchterm/????? so it appears the issue goes fairly deep.

A: 

You can set the character encoding of the URL and FORM scope using the setEncoding() function:

http://www.adobe.com/livedocs/coldfusion/7/htmldocs/wwhelp/wwhimpl/common/html/wwhelp.htm?context=ColdFusion_Documentation&file=00000623.htm

You need to do this before you access any of the variables in this scope.

But, the default encoding of those scopes is already UTF-8, so this may not help. Also, this would probably not affect the CGI scope.

Is the IIS Server logging the correct characters into the request log?

Edward M Smith
+2  A: 

Here's what you could do:

<cfset url.searchTerm = URLEncodedFormat("القاهر", "utf-8") >

<cfset myVar = URLDecode(url.searchTerm , "utf-8") >

Ofcourse, I'd recommend that you work with something like this in that case:

yourtemplate.cfm?searchTerm=%C3%98%C2%A7%C3%99%E2%80%9E

And then you do URL rewriting in IIS (if not already done by framework/rest of the app) http://learn.iis.net/page.aspx/461/creating-rewrite-rules-for-the-url-rewrite-module/ to match your pattern.

zarko.susnjar
+2  A: 

Yeah, it's not really ColdFusion's fault. It's a common problem.

It's mostly the fault of the original CGI specification, which specifies that PATH_INFO has to be %-decoded, thus losing the original %xx byte sequences that would have allowed you to work out which real characters were meant.

And it's partly IIS's fault, because it always tries to read submitted %xx bytes in the path part as UTF-8-encoded Unicode (unless the path isn't a valid UTF-8 byte sequence in which case it plumps for the Windows default code page, but gives you no way to find out this has happened). Having done so, it puts it in environment variables as a Unicode string (as envvars are Unicode under Windows).

However most byte-based tools using the C stdio (and I'm assuming this applies to ColdFusion, as it does under Perl, Python 2, PHP etc.) then try to read the environment variables as bytes, and the MS C runtime encodes the Unicode contents again using the Windows default code page. So any characters that don't fit in the default code page are lost for good. This would include your Arabic characters when running on a Western Windows install.

A clever script that has direct access to the Win32 GetEnvironmentVariableW API could call that to retrieve a native-Unicode environment variable which they could then encode to UTF-8 or whatever else they wanted, assuming that the input was also UTF-8 (which is what you'd generally want today). However, I don't think CodeFusion gives you this access, and in any case it only works from IIS6 onwards; IIS5.x will throw away any non-default-codepage characters before they even reach the environment variables.

Otherwise, your best bet is URL-rewriting. If a layer above CF can convert that search.cfm/القاهرة to search.cfm/?q=القاهرة then you don't face the same problem, as the QUERY_STRING variable, unlike PATH_INFO, is not specified to be %-decoded, so the %xx bytes remain where a tool at CF's level can see them.

bobince