Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?
Anton Korznikov, Andrey Galichin +4
Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduc...