Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
arXiv preprint · 2026
William Lehn-Schiøler, Magnus Ruud Kjær, Rahul Thapa, Magnus Guldberg Pedersen, Anton Mosquera Storgaard, Nick Williams, Radu Gatej, Tue Lehn-Schiøler, Andreas Brink-Kjær, Sadasivan Puthusserypady, Sándor Beniczky, James Zou, Lars Kai Hansen
Summary
EEG foundation models now match or beat specialist baselines on a range of clinical tasks – but their internal computations are a black box, and that opacity is the gating factor on clinical trust. We apply TopK sparse autoencoders to three architecturally distinct EEG transformers (SleepFM, REVE, LaBraM) and recover sparse feature dictionaries from their embedding spaces, giving the first head-to-head look at what these models have actually learned.
Why it matters
Making EEG foundation models accessible and clinically useful for doctors is essential if we want to actually provide value to these doctors. I believe that explainability of these models are a big factor in making the tool useful. This will allow the doctors to correlate the latent signals with the patient in front of them, instead of the AI spitting out a black-box answer.