views:

475

answers:

5

Hi.

We have:

>>> str
'exit\r\ndrwxr-xr-x    2 root     root            0 Jan  1  2000 
\x1b[1;34mbin\x1b[0m\r\ndrwxr-xr-x    3 root     root           
0 Jan  1  2000 \x1b[1;34mlib\x1b[0m\r\ndrwxr-xr-x   10 root     
root            0 Jan  1  1970 \x1b[1;34mlocal\x1b[0m\r\ndrwxr-xr-x    
2 root     root            0 Jan  1  2000 \x1b[1;34msbin\x1b[0m\r\ndrwxr-xr-x    
5 root     root            0 Jan  1  2000 \x1b[1;34mshare\x1b[0m\r\n# exit\r\n'

>>> print str
exit
drwxr-xr-x    2 root     root            0 Jan  1  2000 bin
drwxr-xr-x    3 root     root            0 Jan  1  2000 lib
drwxr-xr-x   10 root     root            0 Jan  1  1970 local
drwxr-xr-x    2 root     root            0 Jan  1  2000 sbin
drwxr-xr-x    5 root     root            0 Jan  1  2000 share
# exit

I want to get rid of all the '\xblah[0m' nonsense using regexp. I've tried

re.sub(str, r'(\x.*m)', '')

But that hasn't done the trick. Any ideas?

+3  A: 

You need the following changes:

  • Escape the backslash
  • Switch to non-greedy matching. Otherwise, everything between the first \x and the last m will be removed, which will be a problem when there is more than one occurrence.
  • The order of arguments is incorrect

Result:

re.sub(r'(\\x.*?m)', '', str)
interjay
@interjay: Ah sorry, got it the wrong way round, thanks. However, this does not work. It has no compaints, but all the guff is still there.
tamb
I didn't notice that the backslashes weren't actually backslashes - see Edward's answer, he has it correct.
interjay
+5  A: 

You have a few issues:

  • You're passing arguments to re.sub in the wrong order wrong. It should be:

    re.sub(regexp_pattern, replacement, source_string)

  • The string doesn't contain "\x". That "\x1b" is the escape character, and it's a single character.

  • As interjay pointed out, you want ".*?" rather than ".*", because otherwise it will match everything from the first escape through the last "m".

The correct call to re.sub is:

print re.sub('\x1b.*?m', '', s)

Alternatively, you could use:

print re.sub('\x1b[^m]*m', '', s)
Edward Loper
@edward-loper: Thanks very much, apologies for getting the params the wrong way round.
tamb
It's an easy mistake to make; it took me a while to get used to how Python regexp arguments are ordered. The basic rule is that the first argument is always the regexp, and the last argument is the string to operate on. (Except for optional arguments like flags or counts, which come at the end, after the string to operate on.)
Edward Loper
A: 

These are ANSI terminal codes. They're signalled by an ESC (byte 27, seen in Python as \x1B) followed by [, then some ;-separated parameters and finally a letter to specify which command it is. (m is a colour change.)

The parameters are usually numbers so for this simple case you could get rid of them with:

ansisequence= re.compile(r'\x1B\[[^A-Za-z]*[A-Za-z]')
ansisequence.sub('', string)

Technically for some (non-colour-related) control codes they could be general strings, which makes the parsing annoying. It's rare you'd meet these, but if you did I guess you'd have to use something complicated like:

\x1B\[((\d+|"[^"]*")(;(\d+|"[^"]*"))*)?[A-Za-z]

Best would be to persuade whatever's generating the string that you're not an ANSI terminal so it shouldnt include colour codes in its output.

bobince
+1  A: 

Here is a pyparsing solution to your problem, with a general parsing expression for those pesky escape sequences. By transforming the initial string with a suppressed expression, this returns a string stripped of all matches of the expression.

s = \
'exit\r\ndrwxr-xr-x    2 root     root            0 Jan  1  2000 ' \
'\x1b[1;34mbin\x1b[0m\r\ndrwxr-xr-x    3 root     root           ' \
'0 Jan  1  2000 \x1b[1;34mlib\x1b[0m\r\ndrwxr-xr-x   10 root     ' \
'root            0 Jan  1  1970 \x1b[1;34mlocal\x1b[0m\r\ndrwxr-xr-x    ' \
'2 root     root            0 Jan  1  2000 \x1b[1;34msbin\x1b[0m\r\ndrwxr-xr-x    ' \
'5 root     root            0 Jan  1  2000 \x1b[1;34mshare\x1b[0m\r\n# exit\r\n' \

from pyparsing import (Literal, Word, nums, Combine, 
    delimitedList, oneOf, alphas, Suppress)

ESC = Literal('\x1b')
integer = Word(nums)
escapeSeq = Combine(ESC + '[' + delimitedList(integer,';') + oneOf(list(alphas)))

s_prime = Suppress(escapeSeq).transformString(s)

print s_prime

This prints your desired output, as stored in s_prime.

Paul McGuire
+1  A: 

Try running ls --color=never -l instead, and you won't get the ANSI escape codes in the first place.

retracile