For a map M from an input set X to an output set Y, I'd imagine that for a given particular output y, the set of inputs mapping to y that have probabilities greater than some reasonable threshold value, is typically tiny compared to the size of X. I want to say: ie: output-conditional input set probabilities are typically highly modular (this might not be an accurate compression of the first sentence). For instance: in text prediction given some large input sequence (say the last 1000 characters), for some predicted character y, almost all of the n^1000 possible input sequences have probabilities essentially equal to 0, but there is some relatively small set of possible input sequences with probabilities much closer to 1

Follow

In fact, almost all of those length-1000 input sequences are meaningless noise. This is also true for images: imagine an image with most of it covered up, the little bit you can see is comprehensible, the set of possible images you infer the whole image could be is huge, and most of those images are random noise save for the part you can see. In terms of machine learning, a more efficient map M would be one that can quickly weed out those possible input sequences that don't effectively contribute to inferring the output

A perfect bayesian agent would be able to predict the correct probabilities for possible outcomes based on all of the information its seen so far, regardless how huge and complex what its predicting is, and how much information its seen. And in terms of machine learning we can actually make such an agent with a simple 2-deep ANN, but the computation costs to do so for even simple systems can be extreme. The challenge is really how to make such an agent efficient

Show thread

One way to make a map M more efficient is to assume it is a composition of maps, and make each constituent map more efficient. For example, lets make M into a composition Mb * Ma, where Ma takes all n^1000 possible inputs and produces n^100 possible outputs, and Mb takes those n^100 possible outputs and produces M's correct output. If Ma is fast and Mb's speed class is the same as the original M's speed class, but lower, then M's new speed is almost certainly less than its original speed. In this case, your effectively weeding out (using Ma) the inputs that don't contribute to the output. All the inputs that have a low probability of producing a given output are filtered out by Ma

Show thread

A noise-correcting function F that takes an input sequence with random errors in it, and produces an output with the errors corrected, is another type of map that reduces the effective size of an input set. For example: all the possible character sequences constituting correct and meaningful english text but with misspellings. Since each sequence without misspellings has many possible misspellings, the number of sequences with misspellings is greater than the number without, and so an F which corrects the misspellings would be compressing the input set (with misspellings) to a much smaller set

Show thread

Another interesting way things are compressed is with generalizations. Statements like 'Fido has a tail', 'Fido has ears', etc 'Buddy has a tail, 'Buddy has ears', etc etc 'Fido is a dog', 'Buddy is a dog' can be compressed into 'Dogs have tails', 'Dogs have ears', etc 'Fido is a dog', 'Buddy is a dog'. And this is almost certainly one of the reasons why analogical comparison, classification, etc are advantageous to evolve: because it takes fewer resources when you use generalizations!

It might not be provable in general (hah) but my feeling is that (itc of intelligence) you always get generalization when you compress things, and visa versa

Show thread
Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.