views:

56

answers:

3

I have a number of websites that are rendering invalid characters. The pages' meta tags specify UTF-8 encoding. However, a number of pages contain characters that can't be interpreted by UTF-8, probably because the files were saved with another encoding (such as ANSI). The one in particular I'm concerned about right now is a fancy apostrophe (as in "Bob’s"...sorry if that doesn't show up correctly). W3's validator indicates the entity is "\x92", but it won't validate the file because it doesn't map to unicode. And, of course, if I open the file in Notepad++ and change the encoding to UTF-8, the character is replaced by a 92 in a black box.

Here's my question: what's the easiest way to fix this? Do I have to open all the pages and replace that character with a conventional apostrophe? Or is there a quick fix I could add (say, to IIS) that might override or fix the encoding issue? Or do I have to brute-force find/replace? I have hundreds of pages on these websites and I have no idea how many of them I'd have to change, so if anyone knows a way I could either circumvent this problem or fix it quickly I would appreciate it.

+1  A: 

I'm not sure about the encoding part of it myself, but if you wind up having to do it by brute force, you could always write a short program that iterates through all of your web pages, loads each file into memory, runs a regex.replace to fix the problem character, and saves the file back to disk. Obviously not ideal but better than opening each file on your own.

Good Luck

DJ Quimby
Good idea. The pages themselves are currently in source control (besides being on the live server), but a program or script that automates this fix may be the easiest solution.
Andy
A: 

All special charcters should be HTML encoded, e.g. a copyright symbol should be in your HTML as

©

HTML entity list:

http://www.w3schools.com/HTML/html_entities.asp

As for how you implement this largely depends on how you are creating the code in the first place, but something like ASP.Net will have server side functions like:

Server.HTMLEncode("string with special chars")
TimS
I know they SHOULD be, but they aren't. I need to fix this for some existing content.
Andy
+1  A: 

Are you serving the pages as straight HTML, or do you have another script serving the content? If you have a script which is serving the content, that script could just look for any instance of \x92 and replace it with an apostrophe. In PHP this would be a simple str_replace()

If you're serving straight HTML then you'll have to actually modify the files themselves. This can be automated, however (and probably should be if you have hundreds of files) depending on what tools you have available to you and what Operating System you're in. Since you said you're using Notepad++ I suppose it's safe to assume you're in MS Windows (therefore no fun Unix commands to speed things up)

It may be possible to create a BATCH script which can do this, however. There are very simple ASCII text editing tools built into Command Prompt. If that's not possible then it's very possible to make a C or C++ program to do this if you have a compiler on your system and moderate knowledge of C. If you have the former and not the latter, ask and I'll whip up some source for you.

steven_desu
Yes, this is a Windows environment. Most of them are static HTML in ASP pages, unfortunately. I'll see if that's a possibility.
Andy