For a map M from an input set X to an output set Y, I'd imagine that for a given particular output y, the set of inputs mapping to y that have probabilities greater than some reasonable threshold value, is typically tiny compared to the size of X. I want to say: ie: output-conditional input set probabilities are typically highly modular (this might not be an accurate compression of the first sentence). For instance: in text prediction given some large input sequence (say the last 1000 characters), for some predicted character y, almost all of the n^1000 possible input sequences have probabilities essentially equal to 0, but there is some relatively small set of possible input sequences with probabilities much closer to 1
In fact, almost all of those length-1000 input sequences are meaningless noise. This is also true for images: imagine an image with most of it covered up, the little bit you can see is comprehensible, the set of possible images you infer the whole image could be is huge, and most of those images are random noise save for the part you can see. In terms of machine learning, a more efficient map M would be one that can quickly weed out those possible input sequences that don't effectively contribute to inferring the output
A perfect bayesian agent would be able to predict the correct probabilities for possible outcomes based on all of the information its seen so far, regardless how huge and complex what its predicting is, and how much information its seen. And in terms of machine learning we can actually make such an agent with a simple 2-deep ANN, but the computation costs to do so for even simple systems can be extreme. The challenge is really how to make such an agent efficient
One way to make a map M more efficient is to assume it is a composition of maps, and make each constituent map more efficient. For example, lets make M into a composition Mb * Ma, where Ma takes all n^1000 possible inputs and produces n^100 possible outputs, and Mb takes those n^100 possible outputs and produces M's correct output. If Ma is fast and Mb's speed class is the same as the original M's speed class, but lower, then M's new speed is almost certainly less than its original speed. In this case, your effectively weeding out (using Ma) the inputs that don't contribute to the output. All the inputs that have a low probability of producing a given output are filtered out by Ma
Another interesting way things are compressed is with generalizations. Statements like 'Fido has a tail', 'Fido has ears', etc 'Buddy has a tail, 'Buddy has ears', etc etc 'Fido is a dog', 'Buddy is a dog' can be compressed into 'Dogs have tails', 'Dogs have ears', etc 'Fido is a dog', 'Buddy is a dog'. And this is almost certainly one of the reasons why analogical comparison, classification, etc are advantageous to evolve: because it takes fewer resources when you use generalizations!
It might not be provable in general (hah) but my feeling is that (itc of intelligence) you always get generalization when you compress things, and visa versa