views:

193

answers:

2

I am using Perl to read UTF-16LE files in Windows 7.

If I read in an ASCII file with following code then each "\r\n" in file will be converted into a "\n" in memory:

open CUR_FILE, "<", $asciiFile; 

If I read in an UTF-16LE(windows 1200) file with following code, this inconsistency cause problems when I trying to regexp lines with line breaks.

open CUR_FILE, "<:encoding(UTF-16LE)", $utf16leFile;

Then "\r\n" will keep unchanged.

Update:
For each line of a UTF-16LE file:

line =~ /(.*)$/

Then the string matched in $1 will include a "\r" at the end...

A: 

That is windows performing that magic for you.... If you specify UTF this is the equivalent of opening the file in binary mode vs text.

Newer versions of Perl have the \R which is a generic newline (ie, will match both \r\n and \n) as well as \v which will match all the OS and Unicode notions of vertical whitespace (ie, \r \n \r\n nonbreaking space, etc)

Does you regex logic allow using \R instead of \n?

drewk
I just use $ as an anchor of the end of line
lz_prgmr
+1  A: 

What version of Perl are you using? UTF-16 and CRLF handling did not mix properly before 5.8.9 (Unicode changes in 5.8.9). I'm not sure about 5.10.0, but it works in 5.10.1 and 5.8.9. You might need to use "<:encoding(UTF-16LE):crlf" when opening the file.

cjm
"<:encoding(UTF-16LE):crlf" doesn't work either, even with the 5.10.1 version
lz_prgmr
@cjm appears broken in my testing on 5.10.1 as well (although admittedly I'm not on windows, I'm just faking it with `PERLIO=crlf` :)
hobbs
`"<:encoding(UTF-16LE):crlf"` definitely works for me (on Linux) with both 5.8.9 and 5.10.1. I only have 5.8.8 on Windows, and that does not work.
cjm