Through the Lenses - A Circuit Odyssey
Why outputs from techniques like LogitLens need a careful relook.
Viveka, one of the projects at the AI Club, IIT Madras, aims to mitigate hallucinations in Large Language Models using Mechanistic Interpretability.
Overview
Mechanistic Interpretability aims to study the components of neural networks like transformers and use knowledge about the mechanisms utilized by these components to predict model behaviour.
The model however, processes information in the form of vectors in high-dimensional spaces, and making sense of these representations in human terms, if it could ever be done, is a significant challenge. Logit Lens has been a technique used repeatedly ([2], [3]) as a method to shed insight on how the model “works” internally — figuratively, a lens to peer into the mind of the model. The approach was particularly interesting because of how the next token seemed converge out of several possible options available.
One problem is that this may not be a faithful decoding of the internal representations, since they are affected both by the activations at the layer of interest, as well as the weights of the layer norm and unembedding matrix. The activations at the intermediate layers may be out of distribution resulting in nonsensical results.
An important point to note is that the vectors corresponding to tokens from the unembedding matrix have a significant variance in their norms, which can cause tokens which are usually improbable in generation to have an outsized impact in analysis with LogitLens. To mitigate this issue, we normalize these, and show that in several cases, this results in fewer nonsensical tokens being generated, and call this simple, yet novel method the Normed Lens.
We provide code to reproduce our results, as well as an implementation of the Tuned Lens paper ([7]) in transformer-lens so that it can be used easily on other models as well.
All our code and exploratory notebook to play around with lenses can be found here.
Introduction
As Large Language Models (LLMs) are being increasingly incorporated into almost every field, hallucinations, or factually incorrect responses generated by LLMs, are a serious issue. While numerous techniques have been proposed to mitigate hallucinations, a promising direction to seek for a solution is Mechanistic Interpretability, the study of internal activations and parameters of machine learning models (LLMs in this case) in order to provide an explanation for its behaviour.
Circuits are a network of blocks or components of a model’s internals that together perform a function over the residual stream.
Previous work([4], [5]) suggest that there exist combination of circuits into which factual recall can be decomposed. We study factual recall circuits in hallucinated and non-hallucinated examples as a potential approach to identify the cause of hallucination.
To locate circuits, we probe the model for intermediate activations, ablate/patch the circuit and study downstream effects due to such local edits on these circuits. As a preliminary step, we used ‘Lenses’ to look at the model’s ‘thought process’, and to broadly locate which parts of the model could have circuits of interest.
In this blog, we present various Lenses currently used such as the Logit Lens, Tuned Lens and introduce a simple, yet novel technique that we call the Normed Lens.

All our experiments were done on gemma-2-2b-it.
All our code and exploratory notebook to play around with lenses can be found here.
Background and Notation
LLMs generate text using a decoder-only transformer architecture. Each input token is first converted into an embedding vector, and the sequence of embeddings passes through multiple layers of transformer blocks—composed of attention and MLP modules. These layers iteratively refine the token representations in what’s known as the residual stream. At the end, an unembedding matrix projects the final residual stream to a vector matching the vocabulary size, producing logits for each token. Applying a softmax function transforms the logits into probabilities across the vocabulary, from which the next token is selected using strategies like top-k, top-p, or greedy sampling.
We will use the following notation convention:
Let the input sequence of tokens be t₁, t₂, . . . tₙ (one hot encoded)
The tokens tᵢ are converted to embedding vectors rᵢ⁽⁰⁾ by retrieving rows of a lookup table Wᴇ R ∈ ⱽ×ᵈ also known as the embedding matrix.
This sequence of embeddings is called the residual stream.
It is then processed through L layers of a transformer. In each layer l, the residual stream is updated by the attention and the MLP block.
Note that throughout the transformer layers, the dimensions of the residual stream are kept to be a constant d.
Finally, the unembedding matrix Wᴇ R ∈ ⱽ×ᵈ maps the final residual stream to the vector space of the dimensions of the size of vocabulary (V).
These logits are converted to a probability distribution on the vocabulary by the softmax function.
Here each element of p is the probability of the corresponding token being generated by the model.
Next tokens are generated from this distribution using various strategies like top-k, top-p or greedy sampling.
Some extras on the transformer architecture
The above presentation of the transformer architecture may be slightly oversimplified. A few small (mostly unrelated) details haven’t been dealt with in the notation to aid simplicity.
Positional embeddings: Typically, transformers encode position of the token in the zero-layer residual stream as positional encodings.
\({r_{i}}^{(0)} = t_i W_E + {W_{pos}}^{i}\)
Many transformers, including Gemma, use rotational positional embeddings (RoPE) instead, so they are included in the attn(.) function above.Layer norm: Typically, before each block of transformer acts on the residual stream, as well as before being decoded by the unembedding matrix, the residual stream is layer-normalized. That is, from each token’s residual vector, its mean is subtracted and the residual vector is divided by its variance. Then, optionally (depends on the model), the residual stream is scaled and translated.
Some architectures, like Gemma, use RMS norm, which is similar to layer norm except that the mean is not subtracted.
The function of layer norm can be interpreted as:Subtracting mean: Removing all components of the residual stream along the 1-direction, ie. the direction along [1 1 . . . 1]. (This means that throughout the layers, the residual stream is not going to use the 1-direction to store any useful information.)
Dividing by variance/norm: Projecting the residual stream on a high-dimensional sphere of radius 1. (This can be seen as a kind of pre-processing of the residual stream to scale its magnitude such that the attention/MLP blocks or the unembedding matrix can work on it.)
So strictly speaking, the transformer blocks equations should be
Lens: Peering into the LLM’s thoughts
In order to even attempt to interpret the complex computations going on inside a transformer, we need some tools that can crudely decode the LLM’s internal ‘thoughts’, as the vectors and matrices by themselves are only as useful to us as a binary is for a computer scientist.
The most intuitive (and perhaps the only) way to unroll the residual stream for us humans to understand is to somehow translate it to words, (or tokens to be more precise).
Over time, many techniques have been developed to map an internal vector of a transformer to a probability distribution over the entire vocabulary, and they are collectively called the Lens.
Logit Lens
Consider the unembedding matrix Wᴇ R ∈ ⱽ×ᵈ
Notice that the unembedding matrix has V columns, each of dimension d.
Calculation of logits involves:
This calculation can be interpreted as taking dot products of the final residual stream over the final token with each of the column vectors of the unembedding matrix to find the logits for each token in the vocabulary.
We can simplify this interpretation by calling each column of the unembedding matrix the ‘unembedding vector’ for the corresponding token in the vocabulary. This way, we can interpret the logit of a particular token (which is directly related to the probability it is going to be assigned) as just the ‘amount’ of the token’s unembedding vector in the residual stream.
(Note: here we are assuming the ‘amount’ of a token’s unembedding vector in the residual stream is characterised by their dot product.)
For example, if the next most probable token is going to be ‘cat’, we can expect the residual stream to be somewhere along the direction of ‘cat’ and related tokens, thus assigning higher logits to tokens like ‘cat’, ‘dog’, ‘feline’ etc. and lower logits to other unrelated tokens, upon multiplying with the unembedding matrix.

This useful bit of interpretation of the unembedding matrix gives us insights into how we can extend the idea for constructing a Lens to interpret the residual stream at any intermediate layer and token position.
The Logit Lens is a simple technique where the residual stream at all layers and token positions is translated into tokens by multiplying it with the unembedding matrix. A way through which we can translate the ‘model space’ to ‘token space’ or ‘word space’ to read ‘LLM’s thoughts’.
The idea is this: If all transformer blocks are adding small changes to the residual stream to eventually align it with the unembedding of the next probable word, we should be able to interpret the residual stream at all layers and token positions by looking at their dot products with the unembedding vectors (aka multiplying with the unembedding matrix).
Once we have the logits, we can convert them into a probability distribution using the softmax function and hopefully the top tokens can give us an idea about the closest direction the residual stream is aligned towards.
Note that we can as well use Logit Lens on m⁽ˡ⁾, the residual stream after the attention and before the mlp block, to peer more closely into the workings of the transformer.
This gives us useful insights into how the model ‘thinks’ its way through the layers to generate its final output.

While trying Logit Lens on gemma-2-2b-it, we observed that rare tokens like ‘_myſelf’, ‘_pleaſure’, etc. popped up unusually often in the intermediate layers, highlighting that Logit Lens wasn’t working quite as expected.
We observe that this behaviour is more pronounced when the model is prompted with ‘Answer in one word’ or similar word limits. Also, most of the time, it happens in the residual stream of escape sequences like <start_of_turn>, <end_of_turn> etc.
Note that the working of logit lens was based on the assumption that the unembedding matrix learned at the final layer provides a meaningful decoding for the residual stream at every intermediate layer as well.
However, the residual stream at earlier layers might often represent features and abstractions that are quite distinct from those at the output layer (for example, as discussed in [6]), meaning that projecting these intermediate activations directly onto the output vocabulary with the final unembedding matrix may not provide a faithful or accurate interpretation.
Tuned Lens
Belrose et al. observed that logit lens failed to yield meaningful results when tested on several recent LLMs.
They attributed this to two key causes:
Transformer layers learn to output residuals that are far from zero on average, hence the input to Logit Lens may be out-of-distribution to the unembedding matrix. In other words, using Logit Lens on an intermediate layer can be though of as setting residual contributions of later layers to zero. However, the unembedding matrix may rely on these, which may act as a bias term.
The transformer hidden states contain a small number of very high variance dimensions and these “rogue dimensions” tend to be distributed unevenly across layers. Ablating these can drastically harm performance, so if the unembedding matrix relies on the presence of these outlier dimensions, perplexity of logit lens predictions might be spuriously high.
We suspected that this could possibly explain the frequent occurrence of rare tokens in the Logit Lens of intermediate layers.

gemma-2-2b-it. The plot on removing the top 2 components suggests that indeed these ‘rogue dimensions’ exist in gemma-2-2b-it activations and are distributed unevenly . Nearly 750 tokens activations were taken. Compute covariance matrix for each layer and take Frobenius cosine similarity between each layer’s covariance matrix. Refer to the implementation here for further details.Belrose et al. proposed that it can be beneficial to have a small, layer-specific affine mapping that transforms its residuals into a space compatible with the final unembedding. That is,
The loss function chosen to learn Aₗ and bₗ is
where p is the probability distribution generated by the model when decoded normally.
Since the Tuned Lens repository does not support gemma-2-2b-it, we provide an implementation using transformer-lens to enable extending its application to a wider variety of models. Access the code at: github.com/ilatims-b/Through-the-Lenses--a-Circuit-Odyssey
We trained Tuned Lens for Gemma-2-2b-it from allenai/c4 ‘en’ subset. Use this training script to train tuned-lens for any model supported by transformer-lens.
Normed Lens
Upon looking carefully at the unembedding matrix, we found that the norms of the unembedding vectors of tokens ‘_myſelf’ , ‘_pleaſure’, etc. were higher than the others and in fact dominated the top norms.
Recall that a token’s logit is just the dot product of its unembedding vector with the residual stream.
The dot product of any two vectors a and b is given by
where θ is the angle between a and b.
Thus, a high dot product can be due to either a small angle between the vectors or due to a high norm of either of the vectors.
Intuitively, while trying to interpret the residual stream, we should only be concerned with how close the residual stream’s direction is to a particular unembedding vector’s direction (i.e. the angle between the two), and not the magnitude of the unembedding vector itself.
The magnitude of the unembedding vector itself can be seen only as a bias learnt by the LLM to maybe aid it while generating the next token, but it should not interfere when we are trying to interpret the residual stream.
Hence, we propose the Normed Lens, a slight modification of the Logit Lens, defined as follows:
Where Wᵤ-bar (which will be referred to as normalized(Wᵤ) in the text that follows) is the unembedding matrix with normalized columns.
In simple words, instead of defining logits as the dot products of the residual stream with the unembedding vectors, we are defining logits as the cosine similarity of the residual stream with the unembedding vectors.
We observed that this significantly removed the occurrences of high-norm tokens in the top-tokens of the Normed Lens and resulted in a (hopefully) cleaner understanding of the residual stream.

A side effect of using this technique was that now all the logits were strictly in the range [-|r⁽ˡ⁾|, |r⁽ˡ⁾|], and while logit values of high-unembedding-norm tokens decreased, the logit values of many tokens having unembedding norm < 1 increased due to the normalization. This resulted in a more dispersed distribution of logits in the vocabulary (on taking softmax). Nevertheless, the top tokens predicted by the Normed Lens were somewhat more interpretable than those predicted by Logit Lens across layers and token positions.
We also explored a few other tricks like trying out other p-norms for normalizing the unembedding matrix, using conditional normalization (so that the tokens having norms less than 1 do not get normalized), and skewing the distribution by taking powers of probabilities and normalizing (so that the over-normalization gets balanced out), but got nearly the same results in all cases.
Normed Lens for Text Generation
If multiplying with normalized(Wᵤ) gives a better interpretation of the residual stream than just Wᵤ, why not entirely replace Wᵤ with normalized(Wᵤ) in the architecture of the transformer itself? Why not use normalized(Wᵤ) for the final decoding (generating the final probability distribution) also?
We show that it is not optimal to use normalized(Wᵤ) for decoding.
Following are some of the examples where decoding using WU turns out to be a pretty bad choice:
The reason behind this is obvious: if we are replacing with normalized(Wᵤ) everywhere, we are basically defining a new model with a changed unembedding matrix - one that has not been learnt during training. So it is unreasonable to expect it to work well unless we have good reasons to do so. Further, we did not find evidence that the norms associated with these tokens are correlated to the frequencies with which they appear in the pre-training dataset (albeit our tests were in a constrained setting since access to training datasets for several models is limited, and gemma-2-2b-it is itself trained completely by distillation).
Forget Intuition, Show Math
In order to do a terse comparison between the performance of Normed Lens and Logit Lens, we need something more than just intuition. It is difficult to tell which is interpreting the LLM better, because we just don’t know what the model is actually thinking (or whether ‘thinking’ in a way that can be seen in token space).
We assume that the ideal Lens should generate a probability distribution that matches the final probability distribution as closely as possible for that transformer layer for all token sequences.
By this assumption, a metric that can be used to measure the efficiency of a Lens is
where KL(.||.) stands for the KL divergence between two distributions.
Kullback-Leibler (KL) divergence is a mathematical measure of how one probability distribution is different from another. Formally, for distributions P (the true distribution) and Q (the approximation), the KL divergence is:
A lower KL divergence value means the lens-generated probabilities are very close to the model’s actual probabilities. A higher value means they are more different.
Note that this was actually the loss function used to learn the affine linear transformation in Tuned Lens!
We plotted the KL divergence values of both Logit Lens and Normed Lens at each layer for a large chunk of text and found the following results:

As is evident from the plots, while the KL divergence for Logit Lens is quite low at the final layers, it is often unusually high at the beginning and the middle layers. On the other hand, while the KL divergence for Normed Lens is not that low in the final layers, it is lower than that of Logit Lens in the initial and middle layers.
Conclusion and Authors’ Remarks
Logit Lens is widely used in circuit analysis in interpretability. Although sufficiently meaningful, it carries a lot of subtleties and might not work for all models. It is always good to do some sanity checks like plot the statistics for the unembedding matrix and, as common in everything else in interpretability, exercise ample caution when analyzing results.
Simply put, Tuned-lens is a proxy to mimic the final layer distribution using an affine transformation. While this may help to some extent, it still is after all, an approximation for some (possibly) non-linear function that decodes the model’s residual stream accurately, if it even is possible to do so.
Future Work
We hypothesized that the unembedding vectors of tokens like ‘_myſelf’ , ‘_pleaſure’, are used by the model to help
softmaxflatten spurious probabilities. Dissecting the effect of such other behaviours by the unembedding matrix is an interesting direction for future work.The norms of these tokens could be correlated to the frequencies with which they occur during pre-training. We checked for such correlations on the
pythiaseries of models, but did not find a strong correlation. Chung et al. suggest interesting observations as to how vocabulary frequency might affect language model pre-training.To use Logit Lens for interpretability more concretely, it is important to understand the unembedding matrix better. Although it depends on various parameters and training setup, we could still approach studying common behaviours performed and modify logit lens specifically for interpretability.
We found that these high-norm tokens appear in the logit lens very often in the residual stream of escape sequences like
<start_of_turn>etc. or when the model is prompted a word limit constraint. Why does this happen? Is it because there are some ‘response-length circuits’ that are interfering with the decoding? Or does it simply mean that the computations going on in those tokens cannot be translated into tokens (which was the implicit assumption we were making all along)?
These are interesting directions we would like to explore in the future.
In our next blog, we will share some of our key findings in circuit analysis to detect hallucinations and some interesting ‘circuits’ where the model consistently pulls up various possible answers to the relation, observation of ‘defaulting’ behavior and hallucination due to domination/interference of other circuits with factual recall.
Authors: Pakshal Nagda*, Smitali Bhandari*, Jayden Koshy Joe
Project Viveka,
AI Club, Centre for Innovation
IIT Madras
*primary contributors
References
nostalgebraist. interpreting GPT: the logit lens, LessWong, 2020 URL: https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens
Danny Halawi, Jean-Stanislas Denain, Jacob Steinhardt. Overthinking the Truth: Understanding how Language Models process False Demonstrations, URL: https://openreview.net/forum?id=em4xg1Gvxa
beren, Sid Black. The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable, LessWrong, 2022 URL: https://www.lesswrong.com/posts/mkbGjzxD8d8XqKHzA/the-singular-value-decompositions-of-transformer-weigh
Ang Lv, Yuhan Chen, Kaiyi Zhang, Yulong Wang, Lifeng Liu, Ji-Rong Wen, Jian Xie, Rui Yan. Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models, URL: https://arxiv.org/pdf/2403.19521
Bilal Chughtai, Alan Cooney, Neel Nanda. Summing Up The Facts: Additive Mechanisms Behind Factual Recall in LLMs, URL: https://arxiv.org/abs/2402.07321
Samuel Marks, Max Tegmark. The Geometry of Truth: Emergent Linear Structure in Large Language Representations of True/ False Datasets, URL: https://openreview.net/forum?id=CeJEfNKstt
Nora Belrose, Igor Ostrovsky, Lev McKinney, Zach Furman, Logan Smith, Danny Halawi, Stella Biderman, Jacob Steinhardt. Eliciting Latent Predictions from Transformers with the Tuned Lens, URL: https://arxiv.org/abs/2303.08112
Woojin Chung, Jeonghoon Kim. Exploiting Vocabulary Frequence Imbalance in Language Model Pre-training, URL: https://arxiv.org/abs/2508.15390v1










