General Q&A
Q: How do we get from attention heads to things like positional encoding or token-pair interactions?
An attention head computes weights that determine how much focus a token (word or subword) should give to other tokens. If we have multiple heads, we can capture relationships.
Visualizing attention matrices on a map is powerful. These are called Attention Maps.
Positional encoding is nothing but adding token embeddings to introduce information on token order.
So, a head attending to the "immediately previous token" suggests a role in positional encoding. Or, a token at distance ±k can also give us valauable information.
Now, token-pairs (subject-object type relationships) just have specific attention heads learning relational patterns. To check if how important the attention head is to understanding what’s going on, if we changed the sentence “The cat chased the mouse” with “The mouse chased the cat,” the attention heads should have changed to reflect this new subject-object role.
Q: How is path patching used to isolate causal effects of specific components
Path Patching is just someone trying to connect the dots of what intermediate change will lead to a different outcome from the model. They modify intermediate activations (pathways) during a forward.
Information in a NN flows as such: embeddings → attention → feedforward layers → logits. And each operation (what the attention heads are upto) forms a path.
The basic idea is:
Run the model forward once with normal activations. This is the original path.
Replace activations at a specific point in the network with activations from a different input. This is the patched path).
> Analyze the effect on the model's output.
> Infer the causal role of that component.
So, we could generate a hypothesis such as “Head 5 in Layer 8 is responsible for encoding subject-object relationships”
Run the forward pass and patch the component by replacing the intermediate activations with activations from a baseline (sentence without the subject-object relationship, noise, or other activations).
Q: What does the logit lens tell us about a model’s reasoning process at different depths?
At each layer of the model, the hidden activations are projected into the vocabulary space using the same output layer weights. This space is what we effectively know as output logits.
These intermediate logits are interpreted as the model's partial predictions at that stage of reasoning, prior to reaching the final output layer. As we get further into the model, the predictions become better and better.
Here’s an example:
- Input: "The cat sat on the ___"
- Early layers: High probabilities for "mat," "chair," and "floor".
- Middle layers: P("mat") increases bc maybe the model recognizes this could be an idiom.
- Final layers: P( "mat") is basically locked in as the correct prediction.
Generally, early layers encode local relationships and are heavily influenced by immediate neighbours. It’s like being a teenager and basing your identity based off your friends. The middle layers slowly begin to capture long-range dependencies // syntatic structure. Finally, the later layers focus on task-specific info and finalize the predictions.
If there’s overconfidence in the early layers, later layers can correct them. What’s interesting is that later layers often exhibit emergent behaviors like complex reasoning which aren’t usually present in early ones.