Epistemic status: needs to be tested to be confirmed, but it seems right
After thinking about it for awhile, humans actually go and search out information they don't have memorized. Most modern LLMs have (almost) all of the information they used memorized, rather than using external resources. If trained so that all semantic information is presented on their inputs, along with their prompt input, I imagine LLMs will not memorize the semantic information (not store the information encoded in their parameters), but *will* store metainformation about that semantic information, and will store information about what they actually have to model (how words go together, the syntax of the language, etc)
So this might be a viable way to train a model so its parameters hold information primarily about some particular aspect of its training set, rather than the training set verbatim. In the LLM case: you train the model so it models how facts interact, rather than both facts and how they interact
To train a model like this, luckily you can use big LLMs already in existence because they act like big databases already. You can also use internet searches
I think you could probably have a system of models that each have been trained to store different sorts of information. For instance, you could have a database model that stores facts about the world (eg: the capital of the USA is Washington DC) but with no world modeling, along with a world modeling model that stores how things interact and procedural information (eg: if I splash water on myself I'll get cold), and integrate them into a unified model
This is also related to biased models. If you train an LLM on one particular kind of prompt, you bias the information it has encoded in its parameters toward that prompt. For instance, an LLM with N parameters, that is trained on a big training set B (eg: a set that includes questions and answers about geography questions), will be able to achieve a lower loss on B than an LLM with N parameters that is trained on a set A (eg: a set of all sorts of questions and answers) which is a superset of B. The LLM trained on just B is biased towards the question-answer pairs in B. Now, there's a risk of the B model overfitting to B if B is small enough. But I'm assuming B is a huge set
A model biased toward, for example, solving equations would synergize with a model that is biased toward memorizing notable equations