> Beginning in late November 2023, the threat actor used a password spray attack to compromise a legacy non-production test tenant account and gain a foothold, and then used the account’s permissions to access a very small percentage of Microsoft corporate email accounts, including members of our senior leadership team and employees in our cybersecurity, legal, and other functions, and exfiltrated some emails and attached documents.
https://msrc.microsoft.com/blog/2024/01/microsoft-actions-following-attack-by-nation-state-actor-midnight-blizzard/
@rysiek
That said, solving that problem would not necessarily be a solution for the original issue:
Imagine that you have a hypothetical training procedure that always converges on some subspace of models with a uniform distribution across them. Imagine that 0.01% of that space is malicious in some way. Then there is no difference in probability density between the (very small) malicious subspace and the rest of the potential outputs of training.
Figuring out that the model we have is specifically chosen to be from _that_ part of the potential output space requires some understanding of how that part is special, and if we have that understanding we can ignore the question of whether the model came from the known training process.
That said, finding a malicious model _that is also a reasonably probable output of the normal training process_ might be computationally hard, or might be impossible (I don't know if people have tried to find adversarial models under that additional constraint).