views:

671

answers:

5

I have a script that records files with UTF8 encoded names. However the script's encoding / environment wasn't set up right, and it just recoded the raw bytes. I now have lots of lines in the file like this:

.../My\ Folders/My\ r\303\266m/...

So there are spaces in the filenames with \ and UTF8 encoded stuff like \303\266 (which is ö). I want to reverse this encoding? Is there some easy set of bash command line commands I can chain together to remove them?

I could get millions of sed commands but that'd take ages to list all the non-ASCII characters we have. Or start parsing it in python. But I'm hoping there's some trick I can do.

A: 

The built-in 'read' function will handle part of the problem:

$ echo "with\ spaces" | while read r; do echo $r; done
with spaces
William Pursell
that was my first attempt, and it does do the spaces, but it does not do the utf8 conversion. For example, $ echo "with\ spaces \303\266" | while read r ; do echo $r ; donewith spaces 303266
Rory
A: 
NawaMan
+2  A: 

Here's a rough stab at the Unicode characters:

text="/My\ Folders/My\ r\303\266m/"
text="echo \$\'"$(echo $text|sed -e 's|\\|\\\\|g')"\'"
text=$(eval "echo $(eval $text)")
read text < <(echo $text)
echo $text

This makes use of the $'string' quoting feature of Bash.

This outputs "/My Folders/My röm/".

Dennis Williamson
You are my new hero!
NawaMan
oh my god... that's... like bash abuse! :P
Rory
@Rory: Why do you think they call it "bash"?
Dennis Williamson
+1  A: 

It is not clear exactly what kind of escaping is being used. The octal character codes are C, but C does not escape space. The space escape is used in the shell, but it does not use octal character escapes.

Something close to C-style escaping can be undone using the command printf %b $escaped. (The documentation says that octal escapes start with \0, but that does not seem to be required by GNU printf.) Another answer mentions read for unescaping shell escapes, although if space is the only one that is not handled by printf %b then handling that case with sed would probably be better.

mark4o
The 'encoding' is basically bash character escaping, (for the spaces), but where the encoding ENV isn't set up, so that it puts the raw octal numbers for the UTF-8 characters.
Rory
A: 

In the end I used something like this:

cat file | sed 's/%/%%/g' | while read -r line ; do printf "${line}\n" ; done | sed 's/\\ / /g'

Some of the files had % in them, which is a printf special character, so I had to 'double it up' so that it would be escaped and passed straight through. The -r in read stops read escaping the \'s however read doesn't turn "\ " into " ", so I needed the final sed.

Rory
If you use `printf "%b\n" $line` then it would not interpret `%` in $line.
mark4o