views:

274

answers:

1

I'm trying to view a UTF-8 text file/stream in less, and even if I invoke it like this:

cat file | LESSCHARSET=utf-8 less

the non-ASCII compatible UTF-8 characters don't display correctly. Instead, their hex values appear highlighted in brackets, e.g. <F4>.

The reading the same text in vim with UTF-8 encoding poses no problems. So I'm thinking something is wrong with the way I'm invoking less.

My locale output is the following

LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

My less version is the one installed by XCode on OSX Leopard:

$ less --version | sed 's/^/    /'
less 394
Copyright (C) 1984-2005 Mark Nudelman

less comes with NO WARRANTY, to the extent permitted by law.
For information about the terms of redistribution, 
see the file named README in the less distribution.
Homepage: http://www.greenwoodsoftware.com/less

locale -a | grep US | sed 's/^/ /' outputs the following:

en_AU.US-ASCII
en_CA.US-ASCII
en_GB.US-ASCII
en_NZ.US-ASCII
en_US
en_US.ISO8859-1
en_US.ISO8859-15
en_US.US-ASCII
en_US.UTF-8
+1  A: 
  1. What does the locale command output? Is it a UTF-8 locale?

  2. Are you sure your terminal is set to display UTF-8? Does echo -e '\xe2\x82\xac' produce the € (euro) sign?

  3. Is the locale that you have set even installed on the system? Is it present in the list that locale -a outputs?

  4. What version of less are you using? (Run less --version to find out.) Really, really old versions did not even support LESSCHARSET. This is less likely to be the case, because I have a Debian "sarge" system with less version 382, and it does not even need LESSCHARSET if the locale is set correctly.

Teddy
LANG="en_US.UTF-8"LC_COLLATE="en_US.UTF-8"LC_CTYPE="en_US.UTF-8"LC_MESSAGES="en_US.UTF-8"LC_MONETARY="en_US.UTF-8"LC_NUMERIC="en_US.UTF-8"LC_TIME="en_US.UTF-8"LC_ALL=
dan
Yes, `echo -e '\xe2\x82\xac'` does produce the euro sign.
dan
Thanks for trying to figure this out for me. I answered your questions above.
dan
@dan Just to check, `echo -e '\xe2\x82\xac'` prints the euro sign, but `echo -e '\xe2\x82\xac' | less` prints a box?
Brian Campbell
actually, `echo -e '\xe2\x82\xac' | less` works correctly, and displays a euro sign. This helped me figure out part of the problem. The file I'm testing was actually encoded in latin-1, and I was incorrectly looking at the Vim `encoding` value rather than the Vim `fileencoding` value to determine its encoding.Doing `LESSCHARSET=latin1 less file` now shows `?` diamonds where the ü character should be. I guess that's as it should be?
dan