tags:

views:

25

answers:

2

Hi,

Is there any available tool in the wild that allows: 1) to figure out the encoding type of a file(e.g, UTF-8, Big5....etc) 2) to convert the encoding type. (EX. Big5 -> UTF-8)

Thanks in advance

+1  A: 

You cannot reliably figure out file encodings because many byte sequences are valid in more than one encoding. As an extreme example, nearly every byte sequence is valid in all fixed-width 8-bit encodings like the ISO-8859 family. Unless you can understand the text you cannot distinguish between those encodings. Apart from that, UTF-8 and UTF-16 are easy to identify, and the heuristic built into the file tool seems to be quite impressive. Once you have identified the encoding, converting is easy. The standard conversion tool on Unix-like systems is called iconv.

I think this belongs on SuperUser.

Philipp
A: 

I'm not sure here what you aiming at with your second question but in my experience FFmpeg is a great tool to recognize all kind of media formats. Just feed it the file - with ffmpeg -i file - you would like to be recognized and it reports back what it knows. For example:

% ffmpeg -i image.png 
FFmpeg version SVN-r21627, Copyright (c) 2000-2010 Fabrice Bellard, et al.
  built on Feb  3 2010 21:28:15 with gcc 4.2.1 (Apple Inc. build 5646) (dot 1)
  configuration: --prefix=/usr/local --enable-gpl --enable-nonfree --enable-shared --enable-postproc --enable-avfilter --enable-avfilter-lavf --enable-pthreads --enable-x11grab --enable-bzlib --enable-libmp3lame --enable-libtheora --enable-libvorbis --enable-libx264 --enable-zlib --enable-libfaac --enable-libfaad
  libavutil     50. 8. 0 / 50. 8. 0
  libavcodec    52.52. 0 / 52.52. 0
  libavformat   52.50. 0 / 52.50. 0
  libavdevice   52. 2. 0 / 52. 2. 0
  libavfilter    1.17. 0 /  1.17. 0
  libswscale     0. 9. 0 /  0. 9. 0
  libpostproc   51. 2. 0 / 51. 2. 0
Input #0, image2, from 'Firefox003.png':
  Duration: 00:00:00.04, start: 0.000000, bitrate: N/A
    Stream #0.0: Video: png, rgb24, 386x319, 25 tbr, 25 tbn, 25 tbc
At least one output file must be specified

As you can see the it is recognized as an image and being a PNG. With some regex you can pick out both. For example while using Java (quoting):

Pattern IMAGE_PATTERN = Pattern.compile("^Input #\\d+?, (image\\d*), from.*?");

And:

Pattern VIDEO_PATTERN = Pattern.compile(".*?\\sVideo: .*?, .*?, ([0-9]+)x([0-9]+).*");

But of course this is only usefull for binaries, I hope it helps.

André van Toly