views:

144

answers:

2

I want to find (not generate) 2 text strings such that, after removing all non letters and ucasing, one string can be translated to the other by simple substitution.

The motivation for this comes from a project I known of that is testing methods for attacking cyphers via probability distributions. I'd like to find a large, coherent plain text that, once encrypted with a simple substitution cypher, can be decrypted to something else that is also coherent.

This ends up as 2 parts, find the longest such strings in a corpus, and get that corpus.


The first part seems to me to be amiable to some sort of attack with a B-tree keyed off the string after a substitution that makes the sequence of first occurrences sequential.

HELLOWORLDTHISISIT
1233454637819a9b98

A little optimization based on knowing the maximum value and length of the string based on each depth of the tree and the rest is just coding.


The Other part would be quite a bit more involved; how to generate a large corpus of text to search? some kind of internet spider would seem to be the ideal approach as it would have access to the largest amount of text but how to strip it to just the text?

The question is; Any ideas on how to do this better?


Edit: the cipher that was being used is an insanely basic 26 letter substitution cipher.

p.s. this is more a thought experiment then a probable real project for me.

A: 

I think you're asking a bit much to generate a substitution that is also "coherent". That is an AI problem for the encryption algorithm to figure out what text is coherent. Also, the longer your text is the more complicated it will be to create a "coherent" result... quickly approaching a point where you need a "key" as long as the text you are encrypting. Thus defeating the purpose of encrypting it at all.

SoapBox
Note, I Explicitly DON'T want to /generate/ the text. Both sides must be /found/ written by some human other than my self. The point being to find a text that will decrypt to the wrong plain-text.
BCS
+1  A: 

There are 26! different substitution ciphers. That works out to a bit over 88 bits of choice:

>>> math.log(factorial(26), 2)
88.381953327016262

The entropy of English text is something like 2 bits per character at least. So it seems to me you can't reasonably expect to find passages of more than 45-50 characters that are accidentally equivalent under substitution.

For the large corpus, there's the Gutenberg Project and Wikipedia, for a start. You can download an dump of all the English Wikipedia's XML files from their website.

Darius Bacon
Wikipidia! good to known! Also, I like your analysis. To make the project fun I would only need about 13-26 letters long so 30-40 would be more than good enough.
BCS