ansaurus

Question

Converting UTF-8(or other 8-bit encoding) to 7 or fewer bits.

Answer 1

+15 A:

Just use gzip compression, and save 60-70% with 0% effort!

Jonathan Feinberg 2009-12-03 04:46:09

I second this. No need to do fancy and error prone bit fumbling when there are perfectly good compression algorithms that will most likely do a much better job compared to a simple mapping from 8 to 7 bits.

kigurai 2009-12-03 09:01:20

Answer 2

+4 A:

Do you understand that files are divided into bytes? Thus, if you did that, you'd have 7 bits of the first letter in bytes 1, plus 1 bit of the second letter, then in byte two, you'd have 6 bits of the second letter, and 2 bits of the third, so on. It would look like this:

|AAAAAAAB|BBBBBBCC|CCCCCDDD|DDDDEEEE|EEEFFFFF|FF...
 \------/ \------/ \------/ \------/ \------/
   byte     byte     byte     byte     byte

Thanatos 2009-12-03 04:48:20

+1 for funny ASCII diagram. If you had used the OP's scheme, you could have saved like 2 bytes.

Jonathan Feinberg 2009-12-03 04:54:34

Answer 3

A:

Your idea is unlikely to work. If you write the byte 0x05 into a file, the byte gets written, all 8 bits of it - with leading zeros. To actually accomplish what you need, you can encode each 8 bytes in 7 bytes (since you only need 8*7 bits to encode 8 values). One approach is keep the 7 values in the 7 low bits of their bytes, and spread the 8th byte over the 7 MSBits.

As for Python, opening a file in binary write mode is open(filename, 'wb'). You'll also have to learn about bit operations to pack bytes as described above.

Just a small example:

>>> a = 0x03
>>> b = 0x59
>>> c = ((a & 0x1) << 7) | b
>>> hex(c)
'0xd9'
>>>

This places the lowest bit of a into the MSBit of c and the rest of c is the value of b.

I'm sure you can take it from here.

Eli Bendersky 2009-12-03 04:51:04

This is exactly what I wanted. Thank you.

Bttc 2009-12-03 14:21:04

Answer 4

+2 A:

Your idea is on the right track, but needs some development. If you're interested in this kind of data compression, you may want to investigate Huffman coding. This is a simple data compression technique that is used in many real-world situations.

I can recommend The Data Compression Book by Mark Nelson which is a great introduction to data compression techniques.

Greg Hewgill 2009-12-03 04:56:28

Ah. I'll look that up, then.

Bttc 2009-12-03 14:19:24

Answer 5

A:

What you need is UTF-7.

Edit: UTF-7 has the advantage of bloating "only" special characters, so if special characters are rare in the input, you get far less bytes than by just converting UTF-8 to 7 bit. That's what UTF-7 is for.

Stefan Schultze 2009-12-03 08:57:43

UTF-7 is an encoding designed for transmission over 7-bits-wide channels. Whatever the OP really needs, it's NOT that. `u'\xa0'.encode('utf8')` -> `'\xc2\xa0' # 2 bytes`; however `u'\xa0'.encode('utf7')` -> `'+AKA-' # FIVE bytes` ... the OP was hoping to decrease the size by 12.5%; you would INCREASE the size by 150%.

John Machin 2009-12-03 10:30:34

Please see my edit above.

Stefan Schultze 2009-12-03 10:35:21

What advantage? There are NO Unicode characters for which the UTF-7 encoding is fewer bytes than the UTF-8 encoding. Let's assume that the input consists of only alphabetic ASCII characters. 'abcde' takes 5 bytes in UTF-7 and 5 bytes in UTF-8 -- how does that reconcile with the "far less bytes" story?

John Machin 2009-12-03 12:53:40

Answer 6

A:

"this would simply leave an extra 0 on the binary string--11011001 might get mapped to 01000001 rather than 1000001, and I won't actually save space."

Correct. Your plan will do nothing.

S.Lott 2009-12-03 11:35:44

ansaurus

tags:

views:

answers:

Converting UTF-8(or other 8-bit encoding) to 7 or fewer bits.

related questions