views:

3651

answers:

5

I'm on OS X 10.5.5 (though it does not matter much I guess)

I have a set of text files with fancy characters like double backquotes, ellipsises ("...") in one character etc.

I need to convert these files to good old plain 7-bit ASCII, preferably without losing character meaning (that is, convert those ellipses to three periods, backquotes to usual "s etc.).

Please advise some smart command-line (bash) tool/script to do that.

+1  A: 

iconv should do it, as far as I know. Not 100% certain about how it handles conversions where one input character should/could become several output characters, such as with the ellipsis example ... Something to try!

Update: I did try it, and it seems it doesn't work. It fails, possibly since it doesn't know how to express ellipsis (the test character I used) in a "smaller" encoding. Converting from UTF-8 to UTF-16 went fine. :/ Still, iconv might be worth investigating further.

unwind
I have not found the proper set of options to force iconv to do it. Can you suggest one?
Alexander Gladysh
+1  A: 

Have a look at transliteration tools; I like Unidecode (in Perl), and it's not too hard to port to other languages.

+1  A: 

The Elinks web browser will convert Unicode entities to their ASCII equivalents, giving things like "--" for "—" and "..." for "…", etc. There is a python module python-elinks which uses the same conversion table, and it would be trivial to turn it into a shell filter, like this:

#!/usr/bin/env python
import elinks
import sys
for line in sys.stdin:
    line = line.decode('utf-8')
    sys.stdout.write(line.encode('ASCII', 'elinks'))
jleedev
A: 

There was a question yesterday or the day before about file renaming, and I showed a Perl script rename.pl that would be usable for the task. The problem area is knowing how the odd characters are encoded, and devising the correct sequence of transliterations. I'd probably do it with an adaptation of that script that did all the mappings sequentially. Doing it one character at a time would be unduly fiddly.

Question was: How to rename with prefix/suffix

Jonathan Leffler
A: 

I have used iconv to convert a file from UTF-16LE (little-endian as I found out by trial and error) that was created by TextPad in Windows into ASCII on OSX like this:

 cat utf16file.txt |iconv -f UTF-16LE -t ASCII > asciifile.txt

You can pipe through hexdump as well to view the characters and make sure you're getting the right output, the terminal knows how to interpret UTF-16 and displays it properly so you can't tell just but doing 'cat' on the file:

cat utf16file.txt | iconv -f UTF-16LE -t ASCII | hexdump -C

This shows the layout with the hex char codes and the ASCII characters to the right-hand side, and you can try different encodings in the -f "from" parameter to figure out what you're dealing with.

Use 'iconv -l' to list the character sets iconv can use on your system.

lennyk