views:

145

answers:

7

Long Ascii String Text may or may not be crushed and compressed into hash kind of ascii "checksum" by using sophisticated mathematical formula/algo. Just like air which can be compressed. To compress megabytes of ascii text into a 128 or so bytes, by shuffling, then mixing new "patterns" of single "bytes" turn by turn from the first to the last. When we are decompressing it, the last character is extracted first, then we just go on decompression using the formula and the sequential keys from the last to the first. The sequential keys and the last and the first bytes must be exactly known, including the fully updated final compiled string, and the total number of bytes which were compressed. This is the terra compression I was thinking about. Is this possible? Can you explain examples. I am working on this theory and it is my own thought.

+3  A: 

In general? Absolutely not.

For some specific cases? Yup. A megabyte of ASCII text consisting only of spaces is likely to compress extremely well. Real text will generally compress pretty well... but not in the order of several megabytes into 128 bytes.

Think about just how many strings - even just strings of valid English words - can fit into several megabytes. Far more than 256^128. They can't all compress down to 128 bytes, by the pigeon-hole principle...

Jon Skeet
You probably meant 2^8^128, not 8^128
Rotsor
@Rotsor: Yes indeed. Editing.
Jon Skeet
+3  A: 

If you have n possible input strings and m possible compressed strings and m is less than n then two strings must map to the same compressed string. This is called the pigeonhole principle and is the fundemental reason why there is a limit on how much you can compress data.

What you are describing is more like a hash function. Many hash functions are designed so that given a hash of a string it is extremely unlikely that you can find another string that gives the same hash. But there is no way that given a hash you can discover the original string. Even if you are able to reverse the hashing operation to produce a valid input that gives that hash, there are infinitely many other inputs that would give the same hash. You wouldn't know which of them is the "correct" one.

Mark Byers
A:0_B:A:0_C:B...D:C...E:D...and so on and on.
Tush
If hash is reversible, this is possible: The Super Data Compression Byte by byte based.
Tush
The previous byte with the reversible checksum settles in the next byte state, the next byte's reversible state hash settles in the byte state next to that byte's state. This creates my own imagined Wheel Theory. The last byte is known and contains besides it, the final state computed of all past byte states.
Tush
@Tush: OK I'll play along... Why do you need 128 bytes? Couldn't you also make it work with 64 bytes? Or just 2 bytes? Or 1 bit? What is special about 128 bytes that makes it possible to store any amount of information?
Mark Byers
I just said it. The larger the integer, the more digits it stores.
Tush
And finally when you can store all the strings you like, it will be as big as your string (give or take some compression).
Eiko
@Tush - I can assure that if you will invent such "super compression" then we can store all information of universe into 128 bytes.
0x69
@0x69: Of course. Once you have a bunch of 128 byte strings, why not just concatenate them together and compress the whole thing to 128 bytes? :) I'll be glad when we don't need hard drives anymore due to the entire universe being representable in 128 bytes.
cHao
A: 

You can compress test to a certain degree because it doesn't use all the available bits (i.e. a-z and A-Z make up 52 out of 256 values). Repeating patterns allow some intelligent storage (zip).

There is no way to store arbitrary large chunks of text in any fixed length number of bytes.

You can compress air, but you won't remove it's molecules! It's mass keeps the same.

Eiko
That mass may get shuffled from an old "state" into a newly updated "stated" by using formulas. Random alphabets in formulated locations. Very complex maths may be used. Just as a tree grows and spreads it's branches randomly and continues to grow. With this formulated and moderated shuffling and updating with new patterns, the hash never expires and continues to create chains of endless bytes of data.
Tush
If you need to store/transmit your huge dictionary the point of compression is mood...
Eiko
@Tush It doesn't matter what "complex maths" you use - you still have to explain how you're not violating the pigeonhole principle. You simply can't make every possible string shorter in a reversible fashion - there aren't enough shorter strings.
Nick Johnson
+3  A: 

Information theory is the scientific field which addresses questions of this kind. It also provides you the possibility to calculate the minimum amount of bits needed to store a compressed message (with lossless compression). This lower bound is known as the Entropy of the message.

Calculation of the Entropy of a piece of text is possible using a Markov model. Such a model uses information how likely a certain sequence of characters of the alphabet is.

0xA3
Markov models don't calculate entropy - at best, they provide an estimate, in the form of an upper bound.
Nick Johnson
+1  A: 

You may be thinking of fractal compression which effectively works by storing a formula and start values. The formula is iterated a certain number of times and the result is an approximation of the original input.

This allows for high compression but is lossy (output is close to input but not exactly the same) and compression can be very slow. Even so, ratios of 170:1 are about the highest achieved at the moment.

Basiclife
I was thinking about a hash based compression. The bytes are one by one hashed to create their individual unique checksums and are then somehow merged into one primary parent hash. In this way, the data chain is created spanning sequentially, millions of bytes.
Tush
That's certainly done as a way of validating data (checksums) but as has been explained elsewhere, the pigeon-hole principle precludes what you're after for arbitrary data.
Basiclife
+2  A: 

The air analogy is very wrong.

When you compress air you make the molecules come closer to each other, each molecule is given less space.

When you compress data you can not make the bit smaller (unless you put your harddrive in a hydraulic press). The closest you can get of actually making bits smaller is increasing the bandwidth of a network, but that is not compression.

Compression is about finding a reversible formula for calculating data. The "rules" about data compression are like

  • The algorithm (including any standard start dictionaries) is shared before hand and not included in the compressed data.
  • All startup parameters must be included in the compressed data, including:
    • Choice of algorithmic variant
    • Choice of dictionaries
    • All compressed data
  • The algorithm must be able to compress/decompress all possible messages in your domain (like plain text, digits or binary data).

To get a feeling of how compression works you may study some examples, like Run length encoding and Lempel Ziv Welch.

Albin Sunnanbo
When you compress air the molecules get tinier? Are you sure about that?
Mark Byers
@Mark: Sorry about the sloppy expression, the size of the actual molecule is the same, it is the space "allocated" to each molecule that shrinks.
Albin Sunnanbo
+1 for nice explanation of bad analogy with air compression. I will add another simple explanation of Why air compression and data compression are different things:- After air compression, total number of molecules doesn't change.- After data compression, total number of bits decreases.
0x69
A: 

This is a bit off topic, but I'm reminded of the Broloid compression joke thread that appeared on USENET ... back when USENET was still interesting.

Seriously, anyone who claims to have a magical compression algorithm that reduces any text megabyte file to a few hundred bytes is either:

  • a scammer,
  • someone who doesn't understand basic information theory, or
  • both.
Stephen C