views:

292

answers:

6

I wish to take a file encoded in UTF-8 that doesn't use more than 128 different characters, then move it to a 7-bit encoding to save the 1/8 of space. For example, if I have a 16 MB text file that only uses the first 128(ascii) characters, I would like to shave off the extra bit to reduce the file to 14MB.

How would I go about doing this?

There doesn't seem to be an existing free or proprietary program to do so, so I was thinking I might try and make a simple(if inefficient) one.

The basic idea I have is to make a function from the current hex/decimal/binary values used for each character to the 128 values I would have in the seven bit encoding, then scan through the file and write each modified value to a new file.

So if the file looked like(I'll use a decimal example because I try not to have to think in hex)

127 254 025 212 015 015 132... It would become

001 002 003 004 005 005 006

If 127 mapped to 001, 254 mapped to 005, etc.

I'm not entirely sure on a couple things, though.

  1. Would this be enough to actually shorten the filesize? I have a bad feeling this would simply leave an extra 0 on the binary string--11011001 might get mapped to 01000001 rather than 1000001, and I won't actually save space. If this would happen, how do I get rid of the zero?
  2. How do I open the file to read/write in binary/decimal/hex rather than just text? I've mostly worked with Python, but I can muddle through C if I must.

Thank you.

+15  A: 

Just use gzip compression, and save 60-70% with 0% effort!

Jonathan Feinberg
I second this. No need to do fancy and error prone bit fumbling when there are perfectly good compression algorithms that will most likely do a much better job compared to a simple mapping from 8 to 7 bits.
kigurai
+4  A: 

Do you understand that files are divided into bytes? Thus, if you did that, you'd have 7 bits of the first letter in bytes 1, plus 1 bit of the second letter, then in byte two, you'd have 6 bits of the second letter, and 2 bits of the third, so on. It would look like this:

|AAAAAAAB|BBBBBBCC|CCCCCDDD|DDDDEEEE|EEEFFFFF|FF...
 \------/ \------/ \------/ \------/ \------/
   byte     byte     byte     byte     byte
Thanatos
+1 for funny ASCII diagram. If you had used the OP's scheme, you could have saved like 2 bytes.
Jonathan Feinberg
A: 

Your idea is unlikely to work. If you write the byte 0x05 into a file, the byte gets written, all 8 bits of it - with leading zeros. To actually accomplish what you need, you can encode each 8 bytes in 7 bytes (since you only need 8*7 bits to encode 8 values). One approach is keep the 7 values in the 7 low bits of their bytes, and spread the 8th byte over the 7 MSBits.

As for Python, opening a file in binary write mode is open(filename, 'wb'). You'll also have to learn about bit operations to pack bytes as described above.

Just a small example:

>>> a = 0x03
>>> b = 0x59
>>> c = ((a & 0x1) << 7) | b
>>> hex(c)
'0xd9'
>>>

This places the lowest bit of a into the MSBit of c and the rest of c is the value of b.

I'm sure you can take it from here.

Eli Bendersky
This is exactly what I wanted. Thank you.
Bttc
+2  A: 

Your idea is on the right track, but needs some development. If you're interested in this kind of data compression, you may want to investigate Huffman coding. This is a simple data compression technique that is used in many real-world situations.

I can recommend The Data Compression Book by Mark Nelson which is a great introduction to data compression techniques.

Greg Hewgill
Ah. I'll look that up, then.
Bttc
A: 

What you need is UTF-7.

Edit: UTF-7 has the advantage of bloating "only" special characters, so if special characters are rare in the input, you get far less bytes than by just converting UTF-8 to 7 bit. That's what UTF-7 is for.

Stefan Schultze
UTF-7 is an encoding designed for transmission over 7-bits-wide channels. Whatever the OP really needs, it's NOT that. `u'\xa0'.encode('utf8')` -> `'\xc2\xa0' # 2 bytes`; however `u'\xa0'.encode('utf7')` -> `'+AKA-' # FIVE bytes` ... the OP was hoping to decrease the size by 12.5%; you would INCREASE the size by 150%.
John Machin
Please see my edit above.
Stefan Schultze
What advantage? There are NO Unicode characters for which the UTF-7 encoding is fewer bytes than the UTF-8 encoding. Let's assume that the input consists of only alphabetic ASCII characters. 'abcde' takes 5 bytes in UTF-7 and 5 bytes in UTF-8 -- how does that reconcile with the "far less bytes" story?
John Machin
A: 

"this would simply leave an extra 0 on the binary string--11011001 might get mapped to 01000001 rather than 1000001, and I won't actually save space."

Correct. Your plan will do nothing.

S.Lott