ansaurus

Question

How do I improve breaking substitution ciphers programmically?

Answer 1

+2 A:

Looking at character pairs makes a lot of sense to me.

Every single letter of the alphabet can be used in valid text, but there are many pairs that are either extremely unlikely or will never happen.

For example, there is no way to get qq using valid English words, as every q must be followed by a u. If you have the same letters repeated in the encrypted text, you can automatically exclude the possibility that they represent q.

The fact that you are removing spaces from the input limits the utility somewhat since combinations that would never exist in a single word e.g. ht can now occur if the h ends one word and the t begins another one. Still, I suspect that these additional data points will enable you to resolve much shorter strings of text.

Also, I would suggest that Romeo and Juliette is only a good basis for statistical data if you intend to analyze writings of the period. There have been some substantial changes to spelling and word usage that may skew the statistics.

Eric J. 2010-02-18 18:35:59

The reason I am using Romeo and Juliet is because it is a large english work which is freely and easily available because it is out of copyright. I am unaware of any recent large english works which are available digitally and copyright free for such use as this.

Sam Phelps 2010-02-18 19:13:19

@Sam: You want Project Gutenberg: http://www.gutenberg.org/wiki/Main_Page which has a huge selection of rather more recent texts. Though still not *very* recent.

dmckee 2010-02-18 19:24:55

Moby Dick might be a good one.

caf 2010-02-22 23:50:08

Answer 2

+1 A:

You might try looking at pairs rather than individual letters. For instance, a t is often followed by an h in English, as is an s. Markov modeling would be useful here.

Joel 2010-02-18 18:37:18

Answer 3

+2 A:

First of all, Romeo and Juliet probably isn't a very good basis to use. Second, yes digraphs are helpful (and so are trigraphs). For a substitution cipher like you're looking at, a good place to start would be the Military Cryptanalysis books by William Friedman.

Jerry Coffin 2010-02-18 18:39:38

Answer 4

+2 A:

Well, I have solved some simple substitution ciphers in my time, so I can speak freely. Removing the spaces from the input string makes it nearly impossible to solve.

While it is true that most English sentences have 'e' in higher frequency, that is not all there is to the process.

The part that makes the activity fun, is the series of trial hypothesis/test hypothesis/accept or reject hypothesis that makes the whole thing an iterative process.

Many sentences contain the words 'of' and 'the'. By looking at your sentence, and assuming that one of the two letter words is of, implies further substitutions that can allow you to make inferences about other words. In short, you need a dictionary of high frequency word, to allow you to make further inferences.

As there could be a large amount of backtracking involved, it may be wise to consider a prolog or erlang implementation as a basis for developing the c++ one.

Best of luck to you. Kindly share your results when done.

EvilTeach 2010-02-18 18:41:13

Answer 5

+2 A:

I'm not sure how constrained this problem is, i.e. how many of the decisions you made are yours to change, but here are some comments:

1) Frequency mapping is not enough to solve a puzzle like this, many frequencies are very close to each other and if you aren't using the same text for frequency source and plaintext, you are almost guaranteed to have a few letters off no matter how long the text. Different materials will have different use patterns.

2) Don't strip the spaces if you can help it. This will allow you to validate your potential solution by checking that some percentage of the words exist in a dictionary you have access to.

3) Look into natural language processing if you really want to get into the language side of this. This book has all you could ever want to know about it.

Edit: I would look into bigraphs and trigraphs first. If you're fairly confident of one or two letters, they can help predict likely candidates for the letters that follow. They're basically probability tables where AB would be the probability of an A being followed by a B. So assuming you have a given letter solved, that can be used to solve the letters next to it, rather than just guessing. For example, if you've got the word "y_u", it's obvious to you that the word is you, but not to the computer. If you've got the letters N, C, and O left, bigraphs will tell you that YN and YC are very uncommon where as YO is much more likely, so even if your text has unusual letter frequencies (which is easy when it's short) you still have a fairly accurate system for solving for unknowns. You can hunt around for a compiled dataset, or do your own analysis, but make sure to use a lot of varied text, a lot of Shakespeare is not the same as half of Shakespeare and half journal articles.

David Kanarek 2010-02-18 18:56:21

What methods other than frequency mapping would be helpful? I assume that after checking to see if a percentage of words match dictionary words, and they don't, the substitutions should be adjusted with those characters with close frequencies until such a requirement is met?

Sam Phelps 2010-02-18 19:00:20

I find it quite hard to think about how to code the program to know how 'confident' it is about certain matches. Obviously looking at it manually allows me to realise what is right and wrong but I'm unsure where to even start thinking about some kind of 'match %' or something similar.For example, I would assume that the most common letters, E T O I etc are most often correct due to the greater difference in occurance generally, but this is sometimes not the case.

Sam Phelps 2010-02-19 13:31:52

As a starting point, when you've solved for every letter, you should have at least 50% word match with a decent English dictionary. If it's any less, go back and try some less common digraphs (e.g. if you picked FO over FA, change it and see how it goes).

David Kanarek 2010-02-19 17:17:17

Answer 6

+2 A:

Single letter word are a big hint (generally only "A" and "I", rarely "O". Casual language allows "K"). There are also a finite set of two and three letter words. No help if spaces have been stripped.
Pairs are much more diagnostic than you would think. For instance: some letters never appear doubled in English (though this is not absolute if the spaces have been stripped or if foreign vocabulary is allowed), and others are common double; also some heterogeneous pairs are very frequent.

As a general rule, no one analysis will provide certainty. You need to assign each cipher letter a set of possible translation with associated probabilities. And combine several tests until the probabilities become very significant.

You may be able to determine when you've gotten close by checking the Shannon Entropy.

dmckee 2010-02-18 19:07:20

I was just thinking about something like having each ciper character with a set of possible plaintext characters for something like this. I think I will follow up this idea.

Sam Phelps 2010-02-18 19:14:57

Answer 7

+2 A:

Not a complete answer, but maybe a helpful pointer: you can use a dictionary to determine how good your plaintext candidate is. On a UNIX system with aspell installed, you can extract an English word list with the command

aspell -l en dump master

Thomas 2010-02-21 19:52:52

Answer 8

+1 A:

Frequency Analysis

Frequency analysis is a great place to start. However, Romeo and Juliet is not a very good choice to take character frequencies from to decipher Pride and Prejudice text. I would suggest using frequencies from this page because it uses 7 different texts that are closer in age to Pride and Prejudice. It also lists probabilities for digraphs and trigraphs. However, digraphs and trigraphs may not be as useful when spaces are removed from the text because this introduces the noise of digraphs and trigraphs created by words being mashed together.

Another resource for character frequencies is this site. It claims to use 'a good mix of different literary genres.'

Frequency analysis generally becomes more probabilistically correct with increased length of the encrypted text as you've seen. Frequency analysis also only helps to suggest the right direction in which to go. For instance, the encrypted character with the highest frequency may be the e, but it could also very well be the a which also has a high frequency. One common method is to start with some of the highest frequency letters in the given language, try matching these letters with different letters of high frequency in the text, and look to see if they form common words like the, that, is, as, and, and so on. Then you go from there.

A Good Introductory Book

If you are looking for a good layman introduction to cryptography, you might try The Code Book by Simon Singh. It's very readable and interesting. The books looks at the development of codes and codebreaking throughout history. He covers substitution ciphers fairly early on and describes some common methods for breaking them. Also, he had a Cipher Challenge in the book (which has already been completed) that consisted of some various codes to try to break including some substitution ciphers. You might try reading through how the Swedish team broke these ciphers at this site. However, I might suggest reading at least through the substitution cipher part of the book before reading these solutions.

By the way I'm not affiliated in any way with the publication of this book. I just really enjoyed it.

Justin Peel 2010-02-22 21:58:59

Answer 9

+1 A:

Regarding digraphs, digrams and word approximations, John Pierce (co-inventor of the transistor and PCM) wrote an excellent book, Introduction to Information Theory, that contains an extended analysis of calculating their characteristics, why you would want to and how to locate them. I found it helpful when writing a frequency analysis decryption code myself.

Also, you will probably want to write an ergodic source to feed your system, rather than relying on a single source (e.g., a novel).

fatcat1111 2010-02-22 22:05:45

ansaurus

tags:

views:

answers:

How do I improve breaking substitution ciphers programmically?

related questions