Exploring Tuned Lenses via Subspace Ablations
Background
Tuned Lenses let you read intermediate layer predictions by mapping hidden states directly into vocabulary-logit space, instead of only inspecting the final layer output.
More broadly, mechanistic interpretability is a field of AI research that tries to reverse-engineer how neural networks compute their outputs in human-understandable terms, rather than treating a model as a black box.
Motivation
I wanted to test whether lens-aligned predictive space can be split into directions that strongly affect logits vs. directions that weakly affect logits.
Core Idea
Using SVD of the unembedding matrix, I split the lens-aligned space into high-singular-value and low-singular-value subspaces, then ablated each and measured prediction shifts with KL divergence.
Results
With pretrained Tuned Lenses, ablating high-singular-value directions caused much larger KL divergence than ablating low-singular-value directions, especially in later layers. This separation disappeared with low-quality lenses trained on a small text subset, indicating high sensitivity to lens quality and training distribution.
Interpretation and Limitations
The results support a linearly identifiable output-relevant subspace in lens-aligned space, but this is still exploratory. Conclusions are limited by reliance on pretrained lenses, no OOD validation, and incomplete follow-up experiments.
Status
This project is exploratory and unfinished. A planned extension was to back-propagate subspace importance through lens maps to identify likely circuit endpoints, but this was not completed.
References
Belrose, N., Henighan, T., Turner, A. M., et al. (2023).
The Tuned Lens: Revealing Representations of LLMs without Intervention.
arXiv:2303.08112. https://arxiv.org/abs/2303.08112