views:

76

answers:

5

I have file of some random text size = 27 gb and after compression it becomes 40 mb or so.

And a 3.5 GB sql file become 45 Mb after compression.

But a 109 mb text file become 72 mb after compression so what can be wrong with it.

Why so less compressed, it must 10 mb or so , or i am missing something .

All files as i can see is English text only and and some grammar symbols (/ , . - = + etc)

Can you tell why ?

If not can you tell how can i super compress a text file ?

I can code in PHP , np in that.

+4  A: 

The compression ratio of a file depends on its content.

Most compression algorithms work by converting repeated data into a single repetition, specifying how many times it was repeated.

For example, a file containing the letter a 1,000,000 times can be compressed far more than a file with completely random content.

For more information, please provide more information.

SLaks
+1  A: 

Compression works by removing duplicates in the input data. Your 3.5GB file becomes much less after compression because it has a lot of duplicate data, while your smaller file isn't compressed as much because it doesn't contain as much duplicate data.

If you want to understand how compression works is most zipping utilities, then look at Wikipedia's Lempel-Ziv-Welch article, which is the algorithm upon which most of these algorithms are built.

PHP is likely the wrong choice for such a project because it's going to be extremely slow in that language compared to perfectly good existing libraries in C which are already part of PHP itself.

Billy ONeal
A: 

Generally the compression level depends on how much similarity and patterns the algorithm can find in the file. If all files contain english text the figures are strange. I strongly suspect that the files that have an extreme compression ratio contain large chunks of repeating text segments.

aioobe
+1  A: 

Think of it this way...if you have a file that contains:

abcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabc

The file essentially just stores abc times 18

On the other hand, this file:

abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz

could only store:

abcdefghijklmnopqrstuvwxyz times 2

Thus, the second file compresses as a larger file than the first, even though it is shorter to begin with.

mattbasta
A: 

Compression works by removing redundancy in data. The definitive place to start is probably with Huffman Coding which is one of the first seminal works directly on the problem, but you may care to dig further back to Shannon's original works on Information Theory.

These are not new concepts - they first gained significant interest back in the 1940's and 50s when people were interested in transmitting data efficiently over very limited channels. The subject is not just of interest to computing either - there's some very deep connections with entropy and other fundamental physics. For instance it turns out perfectly compressed data is indistinguishable from white noise.

Cruachan