Exploring Tuned Lenses via Subspace Ablations

Background

Tuned Lenses let you read intermediate layer predictions by mapping hidden states directly into vocabulary-logit space, instead of only inspecting the final layer output.

More broadly, mechanistic interpretability is a field of AI research that tries to reverse-engineer how neural networks compute their outputs in human-understandable terms, rather than treating a model as a black box.

Motivation

I wanted to test whether lens-aligned predictive space can be split into directions that strongly affect logits vs. directions that weakly affect logits.

Core Idea

Using SVD of the unembedding matrix, I split the lens-aligned space into high-singular-value and low-singular-value subspaces, then ablated each and measured prediction shifts with KL divergence.

Results

With pretrained Tuned Lenses, ablating high-singular-value directions caused much larger KL divergence than ablating low-singular-value directions, especially in later layers. This separation disappeared with low-quality lenses trained on a small text subset, indicating high sensitivity to lens quality and training distribution.

Interpretation and Limitations

The results support a linearly identifiable output-relevant subspace in lens-aligned space, but this is still exploratory. Conclusions are limited by reliance on pretrained lenses, no OOD validation, and incomplete follow-up experiments.

Status

This project is exploratory and unfinished. A planned extension was to back-propagate subspace importance through lens maps to identify likely circuit endpoints, but this was not completed.

View Repository

References

Belrose, N., Henighan, T., Turner, A. M., et al. (2023).
The Tuned Lens: Revealing Representations of LLMs without Intervention.
arXiv:2303.08112. https://arxiv.org/abs/2303.08112