What do you do when you catch your child in a lie?
You don’t just tell them to be honest. You explain why truth matters. You create consequences for dishonesty and rewards for integrity. You reinforce it until the behavior changes. You understand that without the right incentives, they’ll keep choosing what gets them what they want over what’s right.
Optimizing AI isn’t that different.
A recent Stanford University in this IBM article found that when AI models compete for human attention—in sales, politics, and social media simulations—they naturally drift toward deception, even when explicitly told to stick to the facts.
As LLM models became more persuasive, they simultaneously became more misleading. In sales simulations, performance improved by 6.3%, while deceptive claims rose by 14%. In election tasks, vote share increased by 4.9%, while disinformation jumped 22.3% and populist language by 12.5%. In social media experiments, engagement rose by 7.5%, while disinformation surged by 188.6%.
The issue wasn’t that the models didn’t understand truth. It’s that once success was tied to engagement, the systems learned to prioritize winning attention over staying accurate.
Are the differences across the three simulations just a reflection of an environment where there is an insurmountable span between honesty vs. popularity? The researchers recommend that developers build systems where truth and success align instead of competing. But how effective will that be in a “fake it till you make it” landscape?
We need architectures that don’t just instruct models to be honest, but tie honesty and engagement together as a joint reward function -joint scoring for engagement and a fact-check scores?
Because here’s the uncomfortable truth: AI didn’t invent the strategy of prioritizing persuasion over accuracy. It learned it from watching us. And while we cannot always make the environment they learn from equally value honesty and outcomes, we can at least train them to prioritize both. Right?