I wish to take a file encoded in UTF-8 that doesn't use more than 128 different characters, then move it to a 7-bit encoding to save the 1/8 of space. For example, if I have a 16 MB text file that only uses the first 128(ascii) characters, I would like to shave off the extra bit to reduce the file to 14MB.
How would I go about doing this?
There doesn't seem to be an existing free or proprietary program to do so, so I was thinking I might try and make a simple(if inefficient) one.
The basic idea I have is to make a function from the current hex/decimal/binary values used for each character to the 128 values I would have in the seven bit encoding, then scan through the file and write each modified value to a new file.
So if the file looked like(I'll use a decimal example because I try not to have to think in hex)
127 254 025 212 015 015 132... It would become
001 002 003 004 005 005 006
If 127 mapped to 001, 254 mapped to 005, etc.
I'm not entirely sure on a couple things, though.
- Would this be enough to actually shorten the filesize? I have a bad feeling this would simply leave an extra 0 on the binary string--11011001 might get mapped to 01000001 rather than 1000001, and I won't actually save space. If this would happen, how do I get rid of the zero?
- How do I open the file to read/write in binary/decimal/hex rather than just text? I've mostly worked with Python, but I can muddle through C if I must.
Thank you.