tags:

views:

56

answers:

2

I'm writing a bash script that needs to parse html that includes special characters such as @!'ó. Currently I have the entire script running and it ignores or trips on these queries because they're returned from the server as decimal unicode like this: '. I've figured out how to parse and convert to hexadecimal and load these into python to convert them back to their symbols and I am wondering if bash can do this final conversion natively. Simple example in python:

print ur"\u0032" ur"\u0033" ur"\u0040"

prints out

23@

Can I achieve the same result in Bash? I've looked into iconv but I don't think it can do what I want, or more probably I just don't know how.

Here's some relevant information:

Python String Literals

Hex to UTF conversion in Python

And here are some examples of expected input-output.

Ludwig van Beethoven - 5th Symphony and 6th Symphony ''Pastoral'' - Boston Symphony Orchestra - Charles Munch

Ludwig van Beethoven - 5th Symphony and 6th Symphony ''Pastoral'' - Boston Symphony Orchestra - Charles Munch

АлисА (Alisa) - Мы вместе. ХХ лет (My vmeste XX let)

АлисА (Alisa) - Мы вместе. ХХ лет (My vmeste XX let)

+1  A: 

possible solution, e.g.:

$ function conv() { echo $* | python -c 'import re, sys; print re.sub(r"&#(\d+);", lambda x: unichr(int(x.group(1))), sys.stdin.read()).rstrip()' ; }
$ conv 'АлисА (Alisa)' 
АлисА (Alisa)
mykhal
if `UnicodeEncodeError` occurs, add `reload(sys); sys.setdefaultencoding("UTF-8");` after `import re, sys;`
mykhal
Thanks! I'm still new to programming, and I wonder whether calling python or other languages causes a considerable use of system resources?
teratomata
@teratomata yes, it is slow. launching of e.g. perl is considerably fater than python. but i would be still slower than if you could do it in bash (echo) directly
mykhal
+1  A: 

The printf builtin in Bash doesn't support Unicode codes, but the external printf (at least on my GNU-based system) does:

$ /usr/bin/printf "\u0410\u043b\u0438\u0441\u0410"
АлисА

or this, which selects printf from your path in case it's not in /usr/bin:

$ $(type -P printf) "\u0410\u043b\u0438\u0441\u0410"
АлисА
Dennis Williamson
it's really working to you? the values must be converted to hex: `"\u0410\u043b\u0438\u0441\u0410"`
mykhal
@mykhal: Oops, I pasted the wrong thing. Fixed, thanks.
Dennis Williamson
Dennis Williamson: it's much better than sending a string to some python/perl/whatever one-liner
mykhal
Thanks! I thought it shouldn't be as difficult as I thought but now that I'm looking into python it looks fairly attractive to learn.
teratomata
Not to mention someone has already made modules to encode and decode html.
teratomata