views:

1378

answers:

4

GNU sed version 4.1.5 seems to fail with International chars. Here is my input file:

Gras Och Stenar Trad - From Moja to Minneapolis DVD [G2007DVD] 7812 | X
Gras Och Stenar Trad - From Möja to Minneapolis DVD [G2007DVD] 7812 | Y

(Note the umlaut in second line.)

And when I do

sed 's/.*| //' < in

I would expect to see only the X and Y, as I've asked to remove ALL chars up to the '|' and space beyond it. Instead, I get:

X
Gras Och Stenar Trad - From M? Y

I know I can use tr to remove the International chars. first, but is there a way to just use sed?

A: 
[loren@gg ~]$ sed 's/.*| //' ~/a
X
Y
[loren@gg ~]$ uname -rps
FreeBSD 7.0-RELEASE-p3 i386

Weird.. what OS is this on?

Loren Segal
A: 

Becaus sed isn't very well setup for non-ASCII text. However you can use (almost) the same code in perl and get the result you want:

perl -pe 's/.*\| //' x

+5  A: 

I think the error occurs if the input encoding of the file is different from the preferred encoding of your environment.

Example: in is UTF-8

$ LANG=de_DE.UTF-8 sed 's/.*| //' < in
X
Y
$ LANG=de_DE.iso88591 sed 's/.*| //' < in
X 
Y

UTF-8 can safely be interpreted as ISO-8859-1, you'll get strange characters but apart from that everything is fine.

Example: in is ISO-8859-1

$ LANG=de_DE.UTF-8 sed 's/.*| //' < in
X
Gras Och Stenar Trad - From MöY
$ LANG=de_DE.iso88591 sed 's/.*| //' < in
X 
Y

ISO-8859-1 cannot be interpreted as UTF-8, decoding the input file fails. The strange match is probably due to the fact that sed tries to recover rather than fail completely.

The answer is based on Debian Lenny/Sid and sed 4.1.5.

Torsten Marek
That also works, and allows me to use sed. Thanks!
Dave
A: 

I tested the Torsten Marek's answer, and it works for me. I was trying to convert accents in a LaTeX file, and this trick is very nice.

[me@pcbsd documents]$ LANG="pt_BR.ISO-8859-1"
[me@pcbsd documents]$ sed -f converte_acentos.sed document.tex > output.tex

More details:

[me@pcbsd documents]$ uname -srp
FreeBSD 7.2-PRERELEASE i386
[me@pcbsd documents]$ cat converte_acentos.sed
{
    s/á/\\'\{a\}/g
    s/à/\\`\{a\}/g
    s/ã/\\~\{a\}/g
    s/â/\\^\{a\}/g
    s/Á/\\'\{A\}/g
    s/À/\\`\{A\}/g
    s/Ã/\\~\{A\}/g
    s/Â/\\^\{A\}/g
    s/é/\\'\{e\}/g
    s/è/\\`\{e\}/g
    s/ê/\\^\{e\}/g
    s/É/\\'\{E\}/g
    s/È/\\`\{E\}/g
    s/Ê/\\^\{E\}/g
    s/í/\\'\{i\}/g
    s/ì/\\`\{i\}/g
    s/î/\\^\{i\}/g
    s/Í/\\'\{I\}/g
    s/Ì/\\`\{I\}/g
    s/Î/\\^\{I\}/g
    s/ó/\\'\{o\}/g
    s/ò/\\`\{o\}/g
    s/õ/\\~\{o\}/g
    s/ô/\\^\{o\}/g
    s/Ó/\\'\{O\}/g
    s/Ò/\\'\{O\}/g
    s/Õ/\\~\{O\}/g
    s/Ô/\\^\{O\}/g
    s/ú/\\'\{u\}/g
    s/ù/\\`\{u\}/g
    s/û/\\^\{u\}/g
    s/Ú/\\'\{U\}/g
    s/Ù/\\`\{U\}/g
    s/Û/\\^\{U\}/g
    s/ç/\\c\{c\}/g
    s/Ç/\\c\{C\}/g
}