tags:

views:

633

answers:

7

I'm writing a Telnet client of sorts in C# and part of what I have to parse are ANSI/VT100 escape sequences, specifically, just those used for colour and formatting (detailed here).

One method I have is one to find all the codes and remove them, so I can render the text without any formatting if needed:

    
public static string StripStringFormating(string formattedString)
{
    if (rTest.IsMatch(formattedString))
        return rTest.Replace(formattedString, string.Empty);
    else
        return formattedString;
}

I'm new to regular expressions and I was suggested to use this:

static Regex rText = new Regex(@"\e\[[\d;]+m", RegexOptions.Compiled);

However, this failed if the escape code was incomplete due to an error on the server. So then this was suggested, but my friend warned it might be slower (this one also matches another condition (z) that I might come across later):

static Regex rTest = 
              new Regex(@"(\e(\[([\d;]*[mz]?))?)?", RegexOptions.Compiled);

This not only worked, but was in fact faster to and reduced the impact on my text rendering. Can someone explain to a regexp newbie, why? :)

+1  A: 

Without doing detailed analysis, I'd guess that it's faster because of the question marks. These allow the regular expression to be "lazy," and stop as soon as they have enough to match, rather than checking if the rest of the input matches.

I'm not entirely happy with this answer though, because this mostly applies to question marks after * or +. If I were more familiar with the input, it might make more sense to me.

(Also, for the code formatting, you can select all of your code and press Ctrl-K to have it add the four spaces required.)

Ryan Fox
A: 

@Ryan Fox: The input is plain text, with codes indicating when to switch modes for the text following, eg:

<ESC>[1;32mThis is bright green<ESC>[0mThis is the default colour

I'm also a bit confused myself as even if the expression is 'lazy', I'm doing a find and replace operation, so it has to continue until it gets the 'best' match regardless so it knows what to replace no?

Nidonocu
A: 

Nope, I'm pretty sure a lazy expression will stop as soon as it can.

Ryan Fox
+3  A: 

The reason why #1 is slower is that [\d;]+ is a greedy quantifier. Using +? or *? is going to do lazy quantifing. See MSDN - Quantifiers for more info.

You may want to try:

"(\e\[(\d{1,2};)*?[mz]?)?"

That may be faster for you.

Jon Works
A: 

@Jon Works: Afraid that reg expression doesn't seem to work with find and replace, it stops to early and the #;#m part is not matched and gets left behind.

@modesty: I'll try that in future posts.

@Justin Standard: Thanks for fixing my post. Not sure what you did different to me, but it looks fine now at least.

Nidonocu
+2  A: 

Do you really want to do run the regexp twice? Without having checked (bad me) I would have thought that this would work well:

public static string StripStringFormating(string formattedString) {
return rTest.Replace(formattedString, string.Empty); }

If it does, you should see it run ~twice as fast...

Oskar
Thinking about it now, that does make sense, running a regexp on a line with no matches is the same as running a check first to see if it matches at all. You get the same result!
Nidonocu
+1  A: 

I'm not sure if this will help with what you are working on, but long ago I wrote a regular expression to parse ANSI graphic files.

(?s)(?:\e\[(?:(\d+);?)*([A-Za-z])(.*?))(?=\e\[|\z)

It will return each code and the text associated with it. I will try to provide an example in a few minutes.

Input string:

<ESC>[1;32mThis is bright green.<ESC>[0m This is the default color.

Results:

[ [1, 32], m, This is bright green.]
[0, m, This is the default color.]
lordscarlet
Thanks for this reply, I'll keep this expression on hand when I no doubt go back and review the code later for possible improvements. :) As I've discovered, 'larger' regexps tend to be faster than smaller ones.
Nidonocu
I am also interested in anything you're doing with ANSI codes in .NET. I am currently redoing my site in rails rather than .NET, but I am always curious to see how people are able to leverage .NET for interpreting ANSI.
lordscarlet