Adrià Garriga-Alonso

Making AI systems safe and interpretable

email github scholar twitter linkedin

About

I work on AI alignment and mechanistic interpretability. My research on Automated Circuit Discovery helped establish circuit-based interpretability as a field, with ~400 citations and methods adopted across major AI labs. 1 1 ACDC provides automated methods for discovering computational circuits in neural networks, reducing the manual labor previously required for mechanistic analysis.

Currently I'm building Dokimasia, tools to help humans realize their values when using computers—shielding against unwanted and false information. Previously research scientist at FAR AI (leading interpretability projects, building GPU infrastructure), member of technical staff at Redwood Research, and PhD student at Cambridge studying Bayesian neural networks under Carl Rasmussen.

My research interests center on: How can we evaluate interpretability explanations? How can we find algorithmic explanations at lower cost? What explains the behavior of agent-like AIs—what do they want?

Curriculum Vitae

Experience

2026–

Technical Co-founder

Dokimasia · Remote

2023–2025

Research Scientist

FAR AI · Berkeley, CA

2022–2023

Member of Technical Staff

Redwood Research · Berkeley, CA

2021

Summer Research Fellow

Center on Long-Term Risk

2019

Research Intern

Microsoft Research Cambridge

Education

2017–2021

PhD Machine Learning

University of Cambridge · Supervisor: Carl E. Rasmussen

2016–2017

MSc Computer Science (Distinction)

University of Oxford

2012–2016

BSc Computer Science (1st in class)

Pompeu Fabra University, Barcelona

Selected Papers

Towards Automated Circuit Discovery for Mechanistic Interpretability
NeurIPS 2023 Spotlight · ~400 citations
Deep Convolutional Networks as Shallow Gaussian Processes
ICLR 2019 · ~330 citations
Causal Scrubbing: A Method for Rigorously Testing Interpretability Hypotheses
Alignment Forum 2022 · ~90 citations
Open Problems in Mechanistic Interpretability
2025 · ~100 citations

Writing

Notes and essays, continuously updated. Sorted by last edit.

Why Deep Learning Works: Specificity, Not Flexibility ml 2026-01-15
On Consciousness and Moral Weight philosophy 2025-12-20
Testing Integrals: A Practical Guide math 2025-11-08
On Killing vs Letting Die ethics 2025-10-14
Remote Development with Unison tools 2025-09-22
Alternative Population Ethics philosophy 2025-08-05