Can we actually trust reasoning model outputs?
When reasoning models show their work through reflection and traces, do those explanations faithfully represent what's happening? This explores whether self-monitoring mechanisms genuinely correct errors or just create an illusion of reliability.