views:

132

answers:

5

I receive encoded PDF files regularly. The encoding works like this:

  • the PDFs can be displayed correctly in Acrobat Reader
  • select all and copy the test via Acrobat Reader
  • and paste in a text editor
  • will show that the content are encoded

so, examples are:

13579 -> 3579;
hello -> jgnnq

it's basically an offset (maybe swap) of ASCII characters.

The question is how can I find the offset automatically when I have access to only a few samples. I cannot be sure whether the encoding offset is changed. All I know is some text will usually (if not always) show up, e.g. "Name:", "Summary:", "Total:", inside the PDF.

Thank you!

edit: thanks for the feedback. I'd try to break the question into smaller questions:

Part 1: http://stackoverflow.com/questions/2712466/how-to-detect-identical-parts-inside-string

+5  A: 

You need to brute-force it.

If those patterns are simple like +2 character code like in your examples (which is +2 char codes)

h i j
e f g
l m n
l m n
o p q

1 2 3
3 4 5
5 6 7
7 8 9
9 : ;

You could easily implement like this to check against knowns words

>>> text='jgnnq'
>>> knowns=['hello', '13579']
>>>
>>> for i in range(-5,+5): #check -5 to +5 char code range
...     rot=''.join(chr(ord(j)+i) for j in text)
...     for x in knowns:
...         if x in rot:
...             print rot
...
hello
S.Mark
+1: brute force is entirely appropriate when you have limited encryption methods and a crib. This is not Enigma. You may find that the lack of sample is an issue but once you've broken one encryption the rest should fall easily.
High Performance Mark
+1  A: 

Hmmm, thats a tough one.

The only thing I can suggest is using a dictionary (along with some substitution cipher algorithms) may help in decoding some of the text.

But I cannot see a solution that will decode everything for you with the scenario you describe.

Why don't you paste some sample input and we can have ago at decoding it.

zaf
@zaf, sorry that I cannot paste the real data here as it contains personal data. I am quite sure the encoding is by means of ASCII addition or ASCII character swapping.
ohho
OK. You'll have to test the dictionary idea with the simple decoding ideas you have and see if you can extract more info.
zaf
A: 

Do the encoded files open correctly in PDF readers other than Acrobat Reader? If so, you could just use a PDF library (e.g. PDF Clown) and use it to programmatically extract the text you need.

Aistina
I can extract text (in encoded form) same as Acrobat Reader, by pdfminer (a python pdf tool)
ohho
+1  A: 

It's only possible then you have a lot of examples (examples count stops then: possible to get all the combinations or just an linear values dependency or idea of the scenario).

also this question : http://stackoverflow.com/questions/988642/how-would-i-reverse-engineer-a-cryptographic-algorithm have some advices.

Lukas Šalkauskas
+3  A: 

Is the PDF going to contain symbolic (like math or proofs) or natural language text (English, French, etc)?

If the latter, you can use a frequency chart for letters (digraphs, trigraphs and a small dictionary of words if you want to go the distance). I think there are probably a few of these online. Here's a start. And more specifically letter frequencies.

Then, if you're sure it's a Caesar shift, you can grab the first 1000 characters or so and shift them forward by increasing amounts up to (I would guess) 127 or so. Take the resulting texts and calculate how close the frequencies match the average ones you found above. Here is information on that.

The linked letter frequencies page on Wikipedia shows only letters, so you may want to exclude them in your calculation, or better find a chart with them in it. You may also want to transform the entire resulting text into lowercase or uppercase (your preference) to treat letters the same regardless of case.

Edit - saw comment about character swapping

In this case, it's a substitution cipher, which can still be broken automatically, though this time you will probably want to have a digraph chart handy to do extra analysis. This is useful because there will quite possibly be a substitution that is "closer" to average language in terms of letter analysis than the correct one, but comparing digraph frequencies will let you rule it out.

Also, I suggested shifting the characters, then seeing how close the frequencies matched the average language frequencies. You can actually just calculate the frequencies in your ciphertext first, then try to line them up with the good values. I'm not sure which is better.

Phil
so far, the samples I can get hold of are readable ASCIIs (i.e. every readable character in Acrobat is encoded in another human-readable ASCII character)
ohho
Oh, that's perfect then. So then you have an upper bound on the shift length (though I'm not sure - do they make an effort to skip 128-255 by wrapping 128 to 0?). If I were at home I could crack open my old Matlab program from college (think I still have it) and show you the calculations involved. I made a program just like the one you are requesting as part of a crypto course
Phil
@Phil, yes, when I tried to solve it, I have the feeling of doing a college assignment ;-)
ohho