tags:

views:

181

answers:

5

Hello all. I'm coding a email application that produces messages for sending via SMTP. That means I need to change all lone \n and \r characters into the canonical \r\n sequence we all know and love. Here's the code I've got now:

CRLF = '\r\n'
msg = re.sub(r'(?<!\r)\n', CRLF, msg)
msg = re.sub(r'\r(?!\n)', CRLF, msg)

The problem is it's not very fast. On large messages (around 80k) it takes up nearly 30% of the time to send a message!

Can you do better? I eagerly await your Python gymnastics.

A: 

You could start by pre-compiling the regexes, e.g.

FIXCR = re.compile(r'\r(?!\n)')
FIXLN = re.compile(r'(?<!\r)\n')

Then use FIXCR.sub and FIXLN.sub. Next, you could try to combine the regexes into one, with a | thingy, which should also help.

djc
This isn't noticably faster. I think Python must be caching the compiled regexes already.
samtregar
Also, combining into one regex is considerably slower - I'm not shocked, alternation is often the slowest part of a regex engine.
samtregar
+1  A: 
obelix
I'd be completely shocked if this was faster. Walking a string character by character has to be slower than the regex engine, doesn't it?
samtregar
i think that it is the fact that each time he hits a \r or \n the stirng is going to be expanded that is hurting. the only way to know is to run both approaches and see :)
obelix
see perf related info above. test cases etc were very simple and i used the regex below
obelix
Yup, no shock, it's 10x slower. Thanks for playing though!
samtregar
Try it in Python! Your C# code cannot help me.
samtregar
updated again - the original regex's that you provided are infact faster when there are fewer \r's and \n's.Don't have python here :(
obelix
Python is free, you can install it anywhere you want!
samtregar
A: 

Something like this? Compile your regex.

CRLF = '\r\n'
cr_or_lf_regex = re.compile(r'(?:(?<!\r)\n)|(?:\r(?!\n))')

Then, when you want to replace stuff use this:

cr_or_lf_regex.sub(CRLF, msg)

EDIT: Since the above is actually slower, let me take another stab at it.

last_chr = ''

def fix_crlf(input_chr):
    global last_chr
    if input_chr != '\r' and input_chr != '\n' and last_chr != '\r':
        result = input_chr
    else:
        if last_chr == '\r' and input_chr == '\n': result = '\r\n'
        elif last_chr != '\r' and input_chr == '\n': result = '\r\n'
        elif last_chr == '\r' and input_chr != '\n': result = '\r\n%s' % input_chr
        else: result = ''

    last_chr = input_chr
    return result

fixed_msg = ''.join([fix_crlf(c) for c in msg])
ntownsend
Sadly, it's slower!
samtregar
Sooo... Did you time your new solution? I tried something like that and it was 10x slower! Just wondering if I did something wrong in my implementation.
samtregar
Haha! Why didn't you say so? Yeah, it's like 70x slower than the solution you accepted; I just timed it now.I'm going to say I was teaching people a lesson about blindly accepting code off the web. Also, to never search-and-replace like this in Python.
ntownsend
+2  A: 

This regex helped:

re.sub(r'\r\n|\r|\n', '\r\n', msg)

But this code ended up winning:

msg.replace('\r\n','\n').replace('\r','\n').replace('\n','\r\n')

The original regexes took .6s to convert /usr/share/dict/words from \n to \r\n, the new regex took .3s, and the replace()s took .08s.

Sheesh
So awesome - thanks!
samtregar
+1  A: 

Replace them on the fly as you're writing the string to wherever it's going. If you use a regex or anything else you'll be making two passes: one to replace the characters and then one to write it. Deriving a new Stream class and wrapping it around whatever you're writing to is pretty effective; that's the way we do it with System.Net.Mail and that means I can use the same stream encoder for writing to both files and network streams. I'd have to see some of your code in order to give you a really good way to do this though. Also, keep in mind that the actual replacement won't really be any faster, however the total execution time would be reduced since you're only making one pass instead of two (assuming you actually are writing the output of the email somewhere).

Jeff Tucker
That won't work in this case - the message is being handed off to a third-party client library for the actual send. It's a reasonable idea in general though.
samtregar