views:

366

answers:

8

Hello everyone.

I want to write a java program that searches through a cipher text and returns a frequency count of the characters in the cipher, for example the cipher: "jshddllpkeldldwgbdpked" will have a result like this:

2 letter occurrences:

pk = 2, ke = 2, ld = 2

3 letter occurrences:

pke = 2.

Any algorithm that allows me to do this as efficiently as possible?

+2  A: 

The usual approach would be to use some kind of map to map your characters to their counts. You can use a HashMap<Character, Integer> for example. You can then iterate through your ciphertext, character-wise and either put the character into the map with a count of 1 (if it doesn't yet exist) or increment its count.

Joey
+1  A: 

Either have an array with a cell for each possible value (easy if the cipher text is all lower case characters - 26 - harder if not) or go for a Map where you pass in the character and increment the value in either case. The array is quicker but less flexible.

Chris
+4  A: 

The map strategy is a good one, but I'd go for HashMap<String, Integer>, since it's tuples of characters being counted.

Iterating over the characters in the ciphertext, you can save the last X characters and that will give you a map over all occurrences of substrings of length X+1.

Buhb
+1  A: 

You can use hash or graph (Thanks to outis, I know it's special name now, such kind of graphs is called "trie"). Hash will be slower, graph will be faster. Hash will get less memory, graph will take more in bad implementation.

You cannot get it done using array since it will get HUGE amount of memory if your maximum char sequence length is equal to your text length, and text is long enough. If you will limit it it will get smth like ([number of letters]^[max sequence length])*4 bytes which will be (52^4)*4 ~= 24Mb of memory for 4 lower/upper letter sequence. If limited sequence length is ok for you and this memory amount is normal than algorithm will be pretty easy for <=4 letters in sequence.

stroncium
A: 

You could start by looking for the largest possible repeatable sequence first then work your way down from there. For example if the string is 10 characters the largest repeatable sequence that could occur would be 5 letters long so first look for 5 letter sequences then 4 letters and so on till you reach 2. This should reduce the number of iterations in your program.

Gordon
Depends on if repeatable sequences are allowed to overlap. A string of 10 'A' has a size 9 repeatable sequence.
Buhb
I am assuming letters can be reused so the program would find 2x'aaaaa' 2x'aaaa' 3x'aaa' 5x'aa' in that order. It would be simpler alright if letters could be disregarded after use.
Gordon
+2  A: 

You could store the n-grams in a trie, reversing the normal order so the last character in the n-gram is at the top of the trie. Each node in the trie stores a character count. Loop over the string, keeping track of the last N characters (as the unknown googler suggests). Each time through the outer loop, you traverse the trie, using the last N characters to pick the path, starting with the last character and ending with the Nth to last. For each node you visit, incrementing its counter.

To print the n-gram frequencies, perform a breadth-first traversal of the trie.

Overall performance left as an exercise.

outis
+1  A: 

If the set of lengths of sequences you need is fixed, the obvious algorithm takes a linear number of counting operations (say, looking up a counter in a hashtable and incrementing it).

When you say "as efficiently as possible", do you propose to spend a lot of effort for a meagre constant-factor improvement, to search hopelessly for a sublinear algorithm, or do you not understand algorithm complexity classes at all?

A: 

I dont have an answer in mind for this,

But I feel, this algorithm is the exact same as the algorithm used by compression algorithms to create compressed files with the dictionary approach.

If I am not wrong, in this approach, a dictionary is used in the following manner:

data:

abccccabaccabcaaaaabcaaabbbbbccccaaabcbbbbabbabab

parse 1 : key: * value: abc

new data:

*cccabacc*aaaa*aaabbbbbccccaa*bbbbabbabab

Just an educated guess, I think (not sure here) the standard "zip" file uses this approach, so I suggest you look at these algorithms

Salvin Francis
I dont knw why but I have used a lot of "negatives" in the above sentence's grammar :)
Salvin Francis