ansaurus

Question

Answer 1

+5 A:

You need to brute-force it.

If those patterns are simple like +2 character code like in your examples (which is +2 char codes)

h i j
e f g
l m n
l m n
o p q

1 2 3
3 4 5
5 6 7
7 8 9
9 : ;

You could easily implement like this to check against knowns words

>>> text='jgnnq'
>>> knowns=['hello', '13579']
>>>
>>> for i in range(-5,+5): #check -5 to +5 char code range
...     rot=''.join(chr(ord(j)+i) for j in text)
...     for x in knowns:
...         if x in rot:
...             print rot
...
hello

S.Mark 2010-04-26 08:23:28

+1: brute force is entirely appropriate when you have limited encryption methods and a crib. This is not Enigma. You may find that the lack of sample is an issue but once you've broken one encryption the rest should fall easily.

High Performance Mark 2010-04-26 08:47:47

Answer 2

+1 A:

Hmmm, thats a tough one.

The only thing I can suggest is using a dictionary (along with some substitution cipher algorithms) may help in decoding some of the text.

But I cannot see a solution that will decode everything for you with the scenario you describe.

Why don't you paste some sample input and we can have ago at decoding it.

zaf 2010-04-26 08:31:53

@zaf, sorry that I cannot paste the real data here as it contains personal data. I am quite sure the encoding is by means of ASCII addition or ASCII character swapping.

ohho 2010-04-26 08:38:49

OK. You'll have to test the dictionary idea with the simple decoding ideas you have and see if you can extract more info.

zaf 2010-04-26 08:48:40

Answer 3

A:

Do the encoded files open correctly in PDF readers other than Acrobat Reader? If so, you could just use a PDF library (e.g. PDF Clown) and use it to programmatically extract the text you need.

Aistina 2010-04-26 08:43:57

I can extract text (in encoded form) same as Acrobat Reader, by pdfminer (a python pdf tool)

ohho 2010-04-26 08:52:10

Answer 4

+1 A:

It's only possible then you have a lot of examples (examples count stops then: possible to get all the combinations or just an linear values dependency or idea of the scenario).

also this question : http://stackoverflow.com/questions/988642/how-would-i-reverse-engineer-a-cryptographic-algorithm have some advices.

Lukas Šalkauskas 2010-04-26 08:44:34

Answer 5

+3 A:

Is the PDF going to contain symbolic (like math or proofs) or natural language text (English, French, etc)?

If the latter, you can use a frequency chart for letters (digraphs, trigraphs and a small dictionary of words if you want to go the distance). I think there are probably a few of these online. Here's a start. And more specifically letter frequencies.

Then, if you're sure it's a Caesar shift, you can grab the first 1000 characters or so and shift them forward by increasing amounts up to (I would guess) 127 or so. Take the resulting texts and calculate how close the frequencies match the average ones you found above. Here is information on that.

The linked letter frequencies page on Wikipedia shows only letters, so you may want to exclude them in your calculation, or better find a chart with them in it. You may also want to transform the entire resulting text into lowercase or uppercase (your preference) to treat letters the same regardless of case.

Edit - saw comment about character swapping

In this case, it's a substitution cipher, which can still be broken automatically, though this time you will probably want to have a digraph chart handy to do extra analysis. This is useful because there will quite possibly be a substitution that is "closer" to average language in terms of letter analysis than the correct one, but comparing digraph frequencies will let you rule it out.

Also, I suggested shifting the characters, then seeing how close the frequencies matched the average language frequencies. You can actually just calculate the frequencies in your ciphertext first, then try to line them up with the good values. I'm not sure which is better.

Phil 2010-04-26 08:55:01

so far, the samples I can get hold of are readable ASCIIs (i.e. every readable character in Acrobat is encoded in another human-readable ASCII character)

ohho 2010-04-26 09:05:49

Oh, that's perfect then. So then you have an upper bound on the shift length (though I'm not sure - do they make an effort to skip 128-255 by wrapping 128 to 0?). If I were at home I could crack open my old Matlab program from college (think I still have it) and show you the calculations involved. I made a program just like the one you are requesting as part of a crypto course

Phil 2010-04-26 09:13:11

@Phil, yes, when I tried to solve it, I have the feeling of doing a college assignment ;-)

ohho 2010-04-26 09:57:24

ansaurus

tags:

views:

answers:

decoding algorithm wanted

related questions