Lately I've been dealing with compression-related algorithms, and I was wondering which is the best compression ratio that can be achievable by lossless data compression.
So far, the only source I could find on this topic was the Wikipedia:
Lossless compression of digitized data such as video, digitized film, and audio preserves all the information, but can rarely do much better than 1:2 compression because of the intrinsic entropy of the data.
Unfortunately, Wikipedia's article doesn't contain a reference or citation to support this claim. I'm not a data-compression expert, so I'd appreciate any information you can provide on this subject, or if you could point me to a more reliable source than Wikipedia.
I am not sure if anyone has yet explained why the magical number seems to be exactly 1:2 and not, for example, 1:1.1 or 1:20.
One reason is that in many typical cases almost half of the digitised data is noise, and noise (by definition) cannot be compressed.
I did a very simple experiment:
I took a grey card. To a human eye, it looks like a plain, neutral piece of grey cardboard. In particular, there is no information.
And then I took a normal scanner exactly the kind of device that people might use to digitise their photos.
I scanned the grey card. (Actually, I scanned the grey card together with a postcard. The postcard was there for sanity-checking so that I could make sure the scanner software does not do anything strange, such as automatically add contrast when it sees the featureless grey card.)
I cropped a 1000x1000 pixel part of the grey card, and converted it to greyscale (8 bits per pixel).
What we have now should be a fairly good example of what happens when you study a featureless part of a scanned black & white photo, for example, clear sky. In principle, there should be exactly nothing to see.
However, with a larger magnification, it actually looks like this:
There is no clearly visible pattern, but it does not have a uniform grey colour. Part of it is most likely caused by the imperfections of the grey card, but I would assume that most of it is simply noise produced by the scanner (thermal noise in the sensor cell, amplifier, A/D converter, etc.). Looks pretty much like Gaussian noise; here is the histogram (in logarithmic scale):
Now if we assume that each pixel has its shade picked i.i.d. from this distribution, how much entropy do we have? My Python script told me that we have as much as 3.3 bits of entropy per pixel. And that's a lot of noise.
If this really was the case, it would imply that no matter which compression algorithm we use, the 1000x1000 pixel bitmap would be compressed, in the best case, into a 412500-byte file. And what happens in practice: I got a 432018-byte PNG file, pretty close.
If we over-generalise slightly, it seems that no matter which black & white photos I scan with this scanner, I will get the sum of the following:
Now even if your compression algorithm squeezes the useful information into << 1 bits per pixel, you will still have as much as 3 bits per pixel of incompressible noise. And the uncompressed version is 8 bits per pixel. So the compression ratio will be in the ballpark of 1:2, no matter what you do.
Another example, with an attempt to find over-idealised conditions:
And what was the end result? It looks much better than what I got from the scanner; the noise is less pronounced, and there is exactly nothing to be seen. Nevertheless, the Gaussian noise is there:
And the entropy? 2.7 bits per pixel. File size in practice? 344923 bytes for 1M pixels. In a truly best-case scenario, with some cheating, we pushed the compression ratio to 1:3.
Of course all of this has exactly nothing to do with TCS research, but I think it is good to keep in mind what really limits the compression of real-world digitised data. Advances in the design of fancier compression algorithms and raw CPU power is not going to help; if you want to save all the noise losslessly, you cannot do much better than 1:2.
External links referenced by this document: