It seems to my lay understanding that most AI models validate inputs to make sure they're not asking for "bad" outputs. In network services we know it's impossible to blacklist bad inputs. Do any models also evaluate their outputs, like "hah, looks like you almost got me to tell you how to make meth, but I'm not gonna"? The API server equivalent might be "this endpoint expects to return 3 things at most. If I'm about to return 10,000, there's an error."