views:

383

answers:

3

I have several documents I need to convert from ISO-8859-1 to UTF-8 (without the BOM of course). This is the issue though. I have so many of these documents (it is actually a mix of documents, some UTF-8 and some ISO-8859-1) that I need an automated way of converting them. Unfortunately I only have ActivePerl installed and don't know much about encoding in that language. I may be able to install PHP, but I am not sure as this is not my personal computer.

Just so you know, I use Scite or Notepad++, but both do not convert correctly. For example, if I open a document in Czech that contains the character "ž" and go to the "Convert to UTF-8" option in Notepad++, it incorrectly converts it to an unreadable character.

There is a way I CAN convert them, but it is tedious. If I open the document with the special characters and copy the document to Windows clipboard, then paste it into a UTF-8 document and save it, it is okay. This is too tedious (opening every file and copying/pasting into a new document) for the amount of documents I have.

Any ideas? Thanks!!!

+1  A: 

I'm not sure if this is a valid answer to your particular question, but have you looked at the GNU iconv tool? It's fairly generally available.

AKX
+1  A: 

If you have access to cygwin or are able to download a couple of common *nix tools (you'll need bash, grep, iconv and file, all of which are available for windows via, say, gnuwin32), you might be able to write a rather simple shell script that does the job.

The script would approximately look as follows:

for f in *;
do
   if file $f | grep 'ISO-8859' > /dev/null;
   then
      cat $f | iconv -f iso-8859-1 -t utf-8 > $f.converted;
   else
      echo "Not converting $f"
   fi;
done;

You'll need to test the steps though, e.g. I'm not sure what would "file" exactly say for a ISO-8859 document.

KT
+3  A: 

If the character 'ž' is included then the encoding is definitely not ISO-8859-1 ("Latin 1") but is probably CP1252 ("Win Latin 1"). Dealing with a mix of UTF8, ISO-8859-1 and CP1252 (possibly even in the same file) is exactly what the Encoding::FixLatin Perl module is designed for.

You can install the module from CPAN by running this command:

perl -MCPAN -e "install 'Encoding::FixLatin'"

You could then write a short Perl script that uses the Encoding::FixLatin module, but there's an even easier way. The module comes with a command called fix_latin which takes mixed encoding on standard input and writes UTF8 on standard output. So you could use a command line like this to convert one file:

fix_latin <input-file.txt >output-file.txt

If you're running Windows then the fix_latin command might not be in your path and might not have been run through pl2bat in which case you'd need to do something like:

perl C:\perl\bin\fix_latin.pl <input-file.txt >output-file.txt

The exact paths and filenames would need to be adjusted for your system.

To run fix_latin across a whole bunch of files would be trivial on a Linux system but on Windows you'd probably need to use the powershell or similar.

Grant McLean
thanks a lot grant! your script worked very well with the fix_latin command. ill figure out a way to run it through multiple files.
tau