views:

6261

answers:

7

I have a Java app which reads CSV files which have been created in Excel (e.g. 2007). Does anyone know what charset MS Excel uses to save these files in?

I would have guessed either:

  • windows-1255 (Cp1255)
  • ISO-8859-1
  • UTF8

but I am unable to decode extended chars (e.g. french accentuated letters) using either of these charset types.

A: 

OOXML files like those that come from Excel 2007 are encoded in UTF-8, according to wikipedia. I don't know about CSV files, but it stands to reason it would use the same format...

rmeador
A: 

I had a similar problem last week. I received a number of CSV files with varying encodings. Before importing into the database I then used the chardet libary to automatically sniff out the correct encoding.

Chardet is a port from Mozillas character detection engine and if the sample size is large enough (one accentuated character will not do) works really well.

pi
A: 

Russian Edition offers CSV, CSV (Macintosh) and CSV (DOS).

When saving in plain CSV, it uses windows-1251.

I just tried to save French word Résumé along with the Russian text, it saved it in HEX like 52 3F 73 75 6D 3F, 3F being the ASCII code for question mark.

When I opened the CSV file, the word, of course, became unreadable (R?sum?)

Quassnoi
+4  A: 

It appears that they are encoded in Cp850

(I determined this using the following extremely scientific method: open the file in OpenOffice, when you do this you get an option to chose the character set, as you change this it re-renders characters in a preview pane making it easy to chose the correct set)

Joel
CVS files could be in any format, depending on what encoding option was specified: (Save Dialog, Tools Button, Web Options Item, Encoding Tab)
Triynko
I agree with @Triynko. The default encoding will be "ANSI" -- `cp1252` in western Europe and the Americas, `cp1250` in Eastern Europe, `cp1251` in Russia etc, and so forth. `cp850` is a preposterous answer.
John Machin
A: 

cp1250 is used extensively in Microsoft Office documents, including Word and Excel 2003.

http://en.wikipedia.org/wiki/Windows-1250

A simple way to confirm this would be to:

  1. Create a spreadsheet with higher order characters, e.g. "Veszprém" in one of the cells;
  2. Use your favourite scripting language to parse and decode the spreadsheet;
  3. Look at what your script produces when you print out the decoded data.

Example perl script:

#!perl

use strict;

use Spreadsheet::ParseExcel::Simple;
use Encode qw( decode );

my $file    = "my_spreadsheet.xls";

my $xls     = Spreadsheet::ParseExcel::Simple->read( $file );
my $sheet   = [ $xls->sheets ]->[0];

while ($sheet->has_data) {

    my @data = $sheet->next_row;

    for my $datum ( @data ) {
        print decode( 'cp1250', $datum );
    }

}
robbiebow
"cp1250 is used extensively in Microsoft Office documents" ... in YOUR neck of the woods. cp125n is used extensively world-wide, for 0 <= n <= 8.
John Machin
+2  A: 

From memory, Excel uses the machine-specific ANSI encoding. So this would be Windows-1252 for a EN-US installation, 1251 for Russian, etc.

devstuff
True but Excel 2007 allows the user (if they can find the Tools button!) to choose from a long list; "ANSI" is the default.
John Machin
+1  A: 

CVS files could be in any format, depending on what encoding option was specified during the export from Excel: (Save Dialog, Tools Button, Web Options Item, Encoding Tab)

Triynko
and the default encoding is "ANSI", usually one of cp1250 to cp1258.
John Machin