views:

320

answers:

1

Hi all,

I am trying to read a .txt file, with Hebrew column names, but without success.

I uploaded an example file to: http://www.talgalili.com/files/aa.txt

And am trying the command:

read.table("http://www.talgalili.com/files/aa.txt", header = T, sep = "\t")

This returns me with:

  X.....ª X...ª...... X...œ....
1      12          97         6
2     123         354        44
3       6           1         3

Instead of:

אחת שתיים   שלוש
12  97  6
123 354 44
6   1   3

My output for:

l10n_info()

Is:

$MBCS
[1] FALSE

$`UTF-8`
[1] FALSE

$`Latin-1`
[1] TRUE

$codepage
[1] 1252

And for:

Sys.getlocale()

Is:

[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

Can you suggest to me what to try and change to allow me to load the file correctly ?

Update: Trying to use:

read.table("http://www.talgalili.com/files/aa.txt",fileEncoding ="iso8859-8")

Has resulted in:

 V1
1  ?
Warning messages:
1: In read.table("http://www.talgalili.com/files/aa.txt", fileEncoding = "iso8859-8") :
  invalid input found on input connection 'http://www.talgalili.com/files/aa.txt'
2: In read.table("http://www.talgalili.com/files/aa.txt", fileEncoding = "iso8859-8") :
  incomplete final line found by readTableHeader on 'http://www.talgalili.com/files/aa.txt'

While also trying this:

Sys.setlocale("LC_ALL", "en_US.UTF-8")

Or this:

Sys.setlocale("LC_ALL", "en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8")

Get's me this:

[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
  OS reports request to set locale to "en_US.UTF-8" cannot be honored

Finally, here is the > sessionInfo()

R version 2.10.1 (2009-12-14) 
i386-pc-mingw32 

locale:
[1] LC_COLLATE=English_United States.1255  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_2.10.1

Any suggestion or clarification will be appreciated.

Best, Tal

+3  A: 

I would try passing parameter fileEncoding to read.table with a value of iso8859-8.

Use iconvlist() to get an alphabetical list of the supported encodings. As I saw here Hebrew must be part 8 of ISO 8859.

gd047
The file also reads fine for me in UTF-8, so that might be an option as well. File encodings in R have always been trial and error for me.My Sys.getlocale():[1] "en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8"
Kevin
Same here, it works. I have Sys.getlocale() en_US.UTF-8 --- $MBCS [1] TRUE --- $`UTF-8` [1] TRUE --- $`Latin-1` [1] FALSE
Thrawn
Dear gd047, Kevin and Thrawn. I tried gd047 solution and to change to your configuration, and failed in doing so. I updated the main question to reflect that. Any suggestions will be most welcomed. Thanks!
Tal Galili
Two more thoughts: (1) What does sessionInfo() report? What platform? (2) Can you read without the header line and then add the labels afterwards? E.g., read.table(...)[-1, ]; then add column names with names(). I'm wondering if the problem is not with reading the table but handling Hebrew characters in R. Is there a Hebrew locale?
Kevin
Hi Kevin. I added the sessioninfo in the question. The problem is because of the Hebrew - but that is the problem I wish to resolve :)
Tal Galili