tags:

views:

95

answers:

6

As a hobby project, I had like to implement a Morse code encoder and decoder in C++ and Python (both). I was wondering the right data structure I should use for that. Not only is this question related to this specific project, but in general, when one has to make predefined text replacements, what is the best and the fastest way of doing it?

I would avoid re-inventing any data structure if possible (and I think it is). Please note that this is purely a learning exercise and I have always wondered what would be the best way of doing this. I can store the code and the corresponding character in a dictionary perhaps, then iterate over the text and make replacements. Is this the best way of doing this or can I do better?

+1  A: 

In Python, strings are immutable, so probably (depending on what you're doing with the output), you want to create a list of all the simple substitution results. Something like:

MORSE = {'A': '.-', ...}

def morsify(data):
    return [MORSE[c] for c in data if c in MORSE]

You'd need to get corresondingly fancier if you wanted to support different national versions of Morse code etc.

(Edited to deal with the fact that Morse code is apparently not a prefix code.)

llasram
You can't join with an empty string--Morse is not a prefix code--but hopefully that doesn't matter since the questioner is on top of that aspect.
Steve Jessop
@Steve: Morse *should* be a prefix code! According to every source I know, it is.
Konrad Rudolph
@Konrad: "E" is "." "T" is "-". "M" is "--". "O" is "---". That's about as non-prefix as a code can possibly be - *every* symbol other than "E" and "T" has a prefix which is a symbol. Operators leave a pause between letters.
Steve Jessop
Being stupid … [ ](http://example.com)
Konrad Rudolph
+1  A: 

There's no simple optimal structure - for any given fixed mapping, there might be fiendish bit-twiddling optimisations for that precise mapping, that are better or worse on different architectures and different inputs. A map/dictionary should be pretty good all the time, and the code is pretty simple.

My official advice is to stick with that. Lookup performance is very unlikely to be an issue for code like this, since most likely you can easily encode/decode faster than you can input/output.

As it's a learning exercise, and you want to try out different possibilities: for the text -> morse you could use an array rather than a map/dictionary. Perhaps surprisingly, this is difficult to do in C++ and be completely portable. The following assumes that all uppercase letters have char values greater than A, which is not guaranteed by the standard:

std::string encode['Z'-'A'];
encode['A' - 'A'] = ".-";
encode['B' - 'A'] = "-...";
// etc.
encode['Z' - 'A'] = "--..";

If you're willing to assume that your code will only ever run on machines whose basic character set has the letters in a continuous run (true of ASCII, but not EBCDIC), you can tidy it up a bit:

std::string encode[26] = {".-", "-...", /* etc */ "--.."};

To look up the character stored in the variable c:

morse = encode[c - 'A'];

The Python version can assume ASCII (I think), and you'd have to throw in some use of ord.

To cope with anything other than upper case letters, you need either a bigger array (to contain an entry for every possible char value), or else multiple arrays with bounds checks, special case code for punctuation and so on.

Steve Jessop
Interesting strategy. Thanks for the answer.
sukhbir
+1  A: 

The first hit on Google.

Depending on the quantity of text, 'foo'.replace('f', '..-.').replace('o','---') would work.

Unless you're translating thousands of lines of text, you probably won't notice a whole lot of difference in any method that you use - though you can easily use the timeit module to test each different method.

Wayne Werner
For this case, I agree, speed should not matter. But I wanted to know it otherwise, in cases where it did.
sukhbir
+2  A: 

You could use str.translate:

m = {ord('S'): '---', ord('O'): '...'}
print('S O S'.translate(m))

will print:

--- ... ---
Joschua
Python 3.x only?
Nick T
I will read up. Thanks.
sukhbir
No, but you have to made some conversions. In Python 2.6 it's a little bit complicated.
Joschua
+2  A: 
from collections import defaultdict

morsecode = [('a','._'), ('b','_...'), ('c','_._.')]
codedict = defaultdict(lambda:' ')
for k,v in morsecode:
    codedict[k] = v

tomorse = lambda x: ' '.join([codedict[chr] for chr in x])

print tomorse('bab cab')

Gives:

_... ._ _...   _._. ._ _...
Nick T
+1  A: 

On the Python side the string classes translate function is the way to go. On the C++ side I would go with a std::map to hold the character mapping. Then I would probably use std::for_each to do the look up and swap.

stonemetal