Ady’s World     Contact




 Sparse Auto Encoders



Why are these jits important












Architecture // Training













L1 Regularization













Top K Sparsity










L0 Regularization




Hardware // Scalability












Interpretability Metrics






Circuit Analysis





Challenges










References

SAEs aim to retain causally relevant information from neural activations, not just minimize bits (Hinton & Salakhutdinov). 
  • Critical for interpreting model states (Why did GPT-4 output X?).

Intermediate activations (MLP layers) encode superpositioned features (Elhage et al.). 
  • SAEs disentangle these via sparse bottlenecks.

Decoding internal states allows auditing/steering models (Olah et al.) -> we need safety!!


Input (d-dim) → Encoder (W_e: d×m) → ReLULatent (m-dim, sparse) → Decoder (W_d: m×d) → Output

ReLU induces sparsity by zeroing negatives (Nair & Hinton)
  • Leaky ReLU -> sparser gradients.

Loss Function
L = ||x - x̂||² + λ⋅||h||₁ + γ⋅||W_e||²

MSE // cosine similarity for directionality (Bricken et al)
  • Decoder Weights -> often tied (W_d = W_eᵀ) to reduce parameters (Rumelhart et al.)


Adds λ * |latent_activations| to loss function. Penalizes non-zero activations
Many latents fire rarely across all inputs (global sparsity)

But

Pulls all activations toward zero, including important ones
Requires tuning λ (tradeoff: sparsity vs. reconstruction loss)
Latents permanently “off” due to over-penalization —> dead neurons

L1 struggles to resolve this, leading to polysemantic latents.


Force only the K-largest latents to stay active; zero others
Use straight-through estimators (e.g., topk_mask.detach() + (1 - topk_mask) during backward pass

Active latents retain original magnitudes
Set K directly (e.g., 256 active/16M latents)
Lower reconstruction error at same sparsity level vs L1
Higher monosemanticity


Penalize the number of non-zero activations
Direct sparsity control
Stochastic relaxation → noisy gradients


TPUs/GPUs are down bad for Sparsity
Sparse matrix ops (block-sparse kernels) reduce compute/memory
16M-latent SAE on GPT-4 → ~1% active latents per input → 160K ops, feasible on TPUv4.

Scale matters because
  • Larger SAEs capture long-tail features (rare programming syntax, niche cultural concepts, obscure APIs).
  • Scaling laws for interpretability? Early results suggest SAE performance improves predictably with size (Anthropic) (Kaplan)

Diversity Score: Number of unique features per latent (Goh).
Importance Analysis: Ablate latent → measure downstream effect (Wang).

Top-K SAEs achieve ~60% monosemanticity vs ~30% for L1 (Bricken) - jesus christ

SAE + Attention Heads -> map latents to attention heads driving specific behaviors (citation formatting)

Detect "backdoor" latents activated by adversarial prompts (Ilyas).

Linking latents to model outputs is NP-hard
Tools
  • Causal Scrubbing: Edit latents → measure output changes (Wang).
  • Shapley Values: Quantify latent contributions

Initialization Matters
Orthogonal decoder weights prevent feature collapse (Saxe)
Avoiding local minima -> curriculum learning (start dense, increase sparsity).

  • Foundational:
    • Hinton & Salakhutdinov (2006) - Reducing dimensionality with autoencoders.
    • Tibshirani (1996) - L1 regularization (Lasso).
  • SAE-Specific:
    • Bricken et al. (2023) - Scaling SAEs to 16M latents.
    • Cunningham et al. (2023) - Monosemanticity in SAEs.
  • Theory:
    • Elhage et al. (2022) - Superposition in neural networks.
    • Louizos et al. (2018) - L0 regularization.
  • Applications:
    • Olah et al. (2020) - Circuits in transformers.
    • Lundberg & Lee (2017) - SHAP values for interpretability.
  • Hardware:
    • Jouppi et al. (2017) - TPU architecture.
    • Gray et al. (2017) - Block-sparse matrix ops.