A natural history of artificial minds

An independent AI alignment lab. Our work spans interpretability, capability evaluation, and the human side of the loop.

All themes →

Interpretability and model internals

What large language models represent internally, and how those representations give rise to behavior in deployment-realistic settings.

Capability evaluation and elicitation

Paired benchmarks with directional controls for failure modes — deception, scheming, sandbagging — that resist naive measurement.

Human–AI interaction and trust

How expectations form between humans and AI systems, where those expectations break under load, and how evaluation can account for the human side of the loop.