I have written (am writting) a program to analysis encrypted text and attempt to analysis and break it using frequency analysis. The encrypted text takes the form of each letter being substituted for some other letter ie. a->m b->z c->t etc etc. all spaces and non alpha chars are removed and upper case letters made lowercase.
An example would be :
Orginal input - thisisasamplemessageitonlycontainslowercaseletters
Encrypted output - ziololqlqdhstdtllqutozgfsnegfzqoflsgvtkeqltstzztkl
Attempt at cracking - omieieaeanuhtnteeawtiorshylrsoaisehrctdlaethtootde
Here it has only got I, A and Y correctly.
Currently my program cracks it by analysing the frequency of each individual character, and mapping it to the character that appears in the same frequency rank in a non encrypted text.
I am looking for methods and ways to improve the accuracy of my program as at the moment I don't get too many characters right. For example when attempting to crack X amount of characters from Pride and Prejudice I get:
1600 - 10 letters correct
800 - 7 letters correct
400 - 2 letters correct
200 - 3 letters correct
100 - 3 letters correct.
I am using Romeo and Juliet as a base to get the frequency data.
It has been suggested to me to look at and use the frequency of character pairs, but I am unsure how to use this because unless I am using very large encrypted texts I can imagine a similar approach to how I am doing single characters would be even more inaccurate and cause more errors than successes. I am hoping also to make my encryption cracker more accurate for shorter 'inputs'.
Any suggestions would be very helpful.
Thanks.