Interpretability and model internals
What large language models represent internally, and how those representations give rise to behavior in deployment-realistic settings.
An independent AI alignment lab. Our work spans interpretability, capability evaluation, and the human side of the loop.
What large language models represent internally, and how those representations give rise to behavior in deployment-realistic settings.
Paired benchmarks with directional controls for failure modes — deception, scheming, sandbagging — that resist naive measurement.
How expectations form between humans and AI systems, where those expectations break under load, and how evaluation can account for the human side of the loop.