算法可视化与交互学习平台

系统理解 Transformer：Self-Attention 如何让词彼此看见Transformer: Understanding Self-Attention

从最小词序列任务出发，系统拆解 Q/K/V、scaled dot-product attention、softmax、信息融合、多头注意力和 Tiny Transformer 训练过程。

Deep LearningIntermediateFree

Kernel

先抓住这节课的目标

这个模块用一个极小任务理解 Transformer：给定前两个词，预测第三个词，例如 cat likes -> fish。为了让公式、矩阵、代码、表格和空间图能够一一对应，前 6 张卡片都围绕同一个可手算例子展开，并额外加入一个语义疏远的 cloud 作为对照：

python

words = ["cat", "likes", "fish", "cloud"]
X = [
    [1.0, 0.0],    # cat
    [0.5, 0.5],    # likes
    [0.0, 1.0],    # fish
    [-0.8, 0.9],   # cloud: distant contrast token
]
Wq = [[1.0, 0.0], [0.0, 1.0]]
Wk = [[1.0, 0.2], [0.2, 1.0]]
Wv = [[0.9, 0.1], [0.1, 0.9]]
Q = X @ Wq
K = X @ Wk
V = X @ Wv

这里的 Wq、Wk、Wv 是 learned linear projection matrices，也就是高维空间中的线性空间变换矩阵。它们把同一个 X 投影成 Query、Key、Value 三个空间：Q 用来发问，K 用来被匹配，V 用来提供真正被汇总的信息。真实 Transformer 中这些矩阵是模型参数：通常先随机初始化，然后在训练中通过梯度下降学出来。

本节为了手算和画图，先人为指定一组固定的 2x2 矩阵。它们不是训练得到的，也不是唯一正确答案，而是教学用的可解释参数：

matrix	why this value
`Wq = [[1.0, 0.0], [0.0, 1.0]]`	identity matrix，让 `Q` 先保持和 `X` 一样，便于看出“谁在发问”
`Wk = [[1.0, 0.2], [0.2, 1.0]]`	轻微混合两个坐标，让 `K` 和 `X` 有区别，也让 `cloud` 与主句 token 的匹配分数更低
`Wv = [[0.9, 0.1], [0.1, 0.9]]`	轻微平滑 Value 点，让 `attention @ V` 的输出移动更容易看见

也就是说，这三个矩阵在这里是教学用的固定参数；到了后面的 Tiny Transformer 实验，它们会变成 nn.Linear 中可训练的权重。核心直觉是：attention 并不是直接在原始 X 空间里计算关系，而是在投影后的 Q/K 空间里计算关系，再用得到的权重去混合 V。

在这个 toy example 中，每个 token vector 也可以暂时看成二维平面上的一个点：

token	X point	Q point	K point	V point
cat	`(1.0, 0.0)`	`(1.0, 0.0)`	`(1.0, 0.2)`	`(0.9, 0.1)`
likes	`(0.5, 0.5)`	`(0.5, 0.5)`	`(0.6, 0.6)`	`(0.5, 0.5)`
fish	`(0.0, 1.0)`	`(0.0, 1.0)`	`(0.2, 1.0)`	`(0.1, 0.9)`
cloud	`(-0.8, 0.9)`	`(-0.8, 0.9)`	`(-0.62, 0.74)`	`(-0.63, 0.73)`

你会按这条链路走完整个 Self-Attention：

X points -> Q/K/V projected points -> score graph -> attention graph -> output points

学习时请特别注意三件事：第一，这里的 4 来自 token 数量，2 来自每个 token 的向量维度；第二，二维点只是帮助建立空间直觉；第三，attention 里说的 score 不是 Euclidean distance，而是 Query 与 Key 的 dot-product similarity。

Scaled Dot-Product Attention

标准写法分三步：S 是 raw attention scores，A 是按行 softmax 后的 attention matrix，O 是融合 Value 后的新 token 表示。在 toy example 中，Q、K、V 的每一行都是一个二维点；S、A 是 [4,4] 的有向边权矩阵；O 是更新后的 4 个二维点。

\begin{aligned}S&=\frac{QK^{\top}}{\sqrt{d_k}}\\A&=\operatorname{softmax}_{\text{row}}(S)\\O&=AV\end{aligned}

Q

K

V

S

A

O

d_k

Q

Query 矩阵；4 个 query points in 2D，Q.shape = [4,2]

query points

K

Key 矩阵；4 个 key points in 2D，K.shape = [4,2]

key points

V

Value 矩阵；4 个 value points in 2D，V.shape = [4,2]

value points

S

complete directed score graph，S.shape = [4,4]

raw score graph

A

normalized edge-weight graph，A.shape = [4,4]

attention graph

O

updated 2D token points，O.shape = [4,2]

updated token points

d_k

每个 head 中 key/query 的维度；toy example 中 d_k = 2

key/query dimension

从相似度到注意力权重

第 i 行第 j 列表示第 i 个 token point 指向第 j 个 token point 的有向边打分。toy example 中 n=4、d_k=2，所以 Q @ K^T 的形状是 [4,4]。更深一层看，QK^T 不是在原始 X 空间里量距离，而是在投影后的 Q/K 空间里计算 dot-product similarity：q_i^T k_j = ||q_i|| ||k_j|| cos(theta)。方向越一致，cos(theta) 越接近 1，score 越大；方向接近垂直，score 接近 0；方向相反，score 可以为负。

\begin{aligned}s_{ij}&=\frac{q_i^{\top}k_j}{\sqrt{d_k}}\\a_{ij}&=\frac{\exp(s_{ij})}{\sum_{m=1}^{n}\exp(s_{im})}\\\sum_{j=1}^{n}a_{ij}&=1\end{aligned}

s_{ij}

a_{ij}

\sqrt{d_k}

s_{ij}

第 i 个 token 对第 j 个 token 的 scaled raw score

raw attention score from token i to token j

a_{ij}

按行 softmax 后的注意力权重；每一行加起来等于 1

attention weight after softmax

\sqrt{d_k}

缩放项，防止点积分数过大导致 softmax 过于尖锐

scaling factor to keep softmax numerically healthy

Self-Attention 的核心直觉

1. 每个 token 都不是孤立理解自己

在 cat likes fish 中，likes 需要同时参考 cat 和 fish：cat 回答 who likes，fish 回答 likes what。我们额外加入 cloud，不是因为它属于这句话，而是用它作为疏远对照，观察 attention 如何降低无关信息的影响。

2. 把 token 暂时看成二维点

token	toy vector	2D point	possible role
`cat`	`[1.0, 0.0]`	`(1.0, 0.0)`	subject
`likes`	`[0.5, 0.5]`	`(0.5, 0.5)`	relation
`fish`	`[0.0, 1.0]`	`(0.0, 1.0)`	object
`cloud`	`[-0.8, 0.9]`	`(-0.8, 0.9)`	distant contrast

这时 cat 靠近 x-axis，fish 靠近 y-axis，likes 在二者中间，cloud 被放在左上方作为远离主句关系的对照点。这个空间直觉很有用，但不要把 attention 误解成欧氏距离；后面计算的是 dot product similarity。

3. Self-Attention 会生成一张有向图

graph object	attention meaning
node	token vector as a 2D point
directed edge	one token attending to another token
edge weight	attention value
node update	weighted sum of value points

Self-Attention 不是先手动画图再计算，而是矩阵计算自然生成了一张 attention graph。score matrix 是原始有向边打分，attention matrix 是归一化后的边权。

4. 但 3D 图只是观察窗口

真实 token vector 通常在几十、几百甚至几千维空间中。模块里的 3D graph 会把高维表示用 PCA 压到三维，方便观察关系变化。它不是说词真的只生活在三维空间里，而是给高维训练过程开了一扇可视化窗口。

Q、K、V：三个矩阵各自在问什么

1. 标准形式：Q = XWq，K = XWk，V = XWv

更准确的理解是：原始 token embedding X 不会直接拿去同时扮演 Query、Key、Value，而是先经过三套不同的线性投影。这里额外加入 cloud，用于观察疏远 token 的 score 和 attention 会怎样变化。

X = [
    [1.0, 0.0],    # cat
    [0.5, 0.5],    # likes
    [0.0, 1.0],    # fish
    [-0.8, 0.9],   # cloud
]

Wq = [[1.0, 0.0], [0.0, 1.0]]
Wk = [[1.0, 0.2], [0.2, 1.0]]
Wv = [[0.9, 0.1], [0.1, 0.9]]

Q = X @ Wq
K = X @ Wk
V = X @ Wv

2. 三个投影空间各自回答不同问题

matrix	question	role in attention
`Q`	我现在想找什么？	source token 发出的查询点
`K`	我适合被怎样匹配？	target token 提供的匹配点
`V`	如果别人关注我，我贡献什么信息？	最后被加权求和的信息点

一个生活类比是招聘：Query 像岗位需求里的“会 Python 的人”，Key 像候选人简历上的技能标签，Value 则是候选人真正带来的项目经验和能力。岗位需求、匹配标签、真实信息不是一回事，所以模型需要三个空间。后面的点积也不是在原始 X 空间里做，而是在投影后的 Q/K 空间里比较方向是否一致。

3. 为什么不用 Q = K = V = X

如果 Q = K = V = X，所有 token 只能用同一种表达方式同时完成“发问、被匹配、贡献信息”。表达能力会太弱。使用 Wq/Wk/Wv 后，同一个词可以拥有三种视角：在 Query 空间里表达它想找什么，在 Key 空间里表达它如何被别人找到，在 Value 空间里表达它真正传递什么。

4. 线性变换就是重新组织语义空间

transform	geometric effect	attention intuition
rotation	改变方向	改变哪些关系更容易对齐
stretch	强化某些维度	放大重要特征
compression	压缩某些维度	弱化无关特征
shear	改变结构关系	重新排列 token 间的相对位置

所以 Wq/Wk/Wv 不只是“多乘了三个矩阵”，而是在学习什么样的空间结构最有利于预测。

5. 用同一个例子直接算出 Q/K/V 点

token	X point	Q = XWq	K = XWk	V = XWv
`cat`	`(1.0, 0.0)`	`(1.0, 0.0)`	`(1.0, 0.2)`	`(0.9, 0.1)`
`likes`	`(0.5, 0.5)`	`(0.5, 0.5)`	`(0.6, 0.6)`	`(0.5, 0.5)`
`fish`	`(0.0, 1.0)`	`(0.0, 1.0)`	`(0.2, 1.0)`	`(0.1, 0.9)`
`cloud`	`(-0.8, 0.9)`	`(-0.8, 0.9)`	`(-0.62, 0.74)`	`(-0.63, 0.73)`

注意：这里 Wq 暂时取 identity，是为了让 Query 点仍然贴近原始 X 点；Wk 和 Wv 则把 Key、Value 轻微移动，帮助你看到同一个 token 在三个空间里可以有不同位置。

图

二维空间观察台：词的位置、距离、向量和 Attention

直接在二维坐标中观察 token points：X 是原始空间，Q 是发问空间，K 是被匹配空间，V 是信息空间，O 是 attention @ V 后的输出空间。Attention is not computed in the original X space; it is computed between Q and K after projection. Edge width in Scores mode is dot-product similarity, not Euclidean distance. 图表文字使用英文，矩阵和坐标表与前 1~6 张卡片的 cat / likes / fish / cloud 示例一一对应。

Raw scores: Q @ K.T

from \ to	cat	likes	fish	cloud
cat	1.000	0.600	0.200	-0.620
likes	0.600	0.600	0.600	0.060
fish	0.200	0.600	1.000	0.740
cloud	-0.620	0.060	0.740	1.162

Attention: softmax(scores)

from \ to	cat	likes	fish	cloud
cat	0.379	0.286	0.215	0.120
likes	0.272	0.272	0.272	0.185
fish	0.180	0.239	0.317	0.264
cloud	0.114	0.185	0.299	0.403

Vector points and updated points

token	X point	Q point	K point	V point	O point
cat	(1.000, 0.000)	(1.000, 0.000)	(1.000, 0.200)	(0.900, 0.100)	(0.430, 0.462)
likes	(0.500, 0.500)	(0.500, 0.500)	(0.600, 0.600)	(0.500, 0.500)	(0.291, 0.543)
fish	(0.000, 1.000)	(0.000, 1.000)	(0.200, 1.000)	(0.100, 0.900)	(0.147, 0.616)
cloud	(-0.800, 0.900)	(-0.800, 0.900)	(-0.620, 0.740)	(-0.630, 0.730)	(-0.029, 0.667)

Distance, dot product, and attention

pair	X distance	q dot k	attention
cat -> cat	0.000	1.000	0.379
cat -> likes	0.707	0.600	0.286
cat -> fish	1.414	0.200	0.215
cat -> cloud	2.012	-0.620	0.120
likes -> cat	0.707	0.600	0.272
likes -> likes	0.000	0.600	0.272
likes -> fish	0.707	0.600	0.272
likes -> cloud	1.360	0.060	0.185
fish -> cat	1.414	0.200	0.180
fish -> likes	0.707	0.600	0.239
fish -> fish	0.000	1.000	0.317
fish -> cloud	0.806	0.740	0.264
cloud -> cat	2.012	-0.620	0.114
cloud -> likes	1.360	0.060	0.185
cloud -> fish	0.806	0.740	0.299
cloud -> cloud	0.000	1.162	0.403

attention @ V：沿着关系图传播信息

1. scores matrix 来自 Query point 和 Key point 的点积

scores = Q @ K.T 的意思是：用每个 Query point 去和所有 Key point 做点积，得到 raw attention score graph。加入 cloud 后，矩阵从 3x3 变成 4x4，但仍然可以手算。

完整链路是：Q · K -> dot-product similarity，softmax -> attention weights，最后 attention @ V -> information mixing。所以矩阵乘法不是在算“距离”，而是在一次性计算所有 token 之间的方向相似度。

Q = [
    [1.0, 0.0],    # q_cat
    [0.5, 0.5],    # q_likes
    [0.0, 1.0],    # q_fish
    [-0.8, 0.9],   # q_cloud
]

K.T = [
    [1.0, 0.6, 0.2, -0.62],
    [0.2, 0.6, 1.0,  0.74],
]

Q      @     K.T
[4,2]        [2,4]
   ->        [4,4]

例如 cat -> cloud 这条有向边：

q_cat   = (1.0, 0.0)
k_cloud = (-0.62, 0.74)
raw_score(cat -> cloud) = q_cat dot k_cloud
                         = 1.0 * -0.62 + 0.0 * 0.74
                         = -0.62

2. raw scores：矩阵单元格和图中边一一对应

current token	to cat	to likes	to fish	to cloud
`cat`	1.0	0.6	0.2	-0.62
`likes`	0.6	0.6	0.6	0.06
`fish`	0.2	0.6	1.0	0.74
`cloud`	-0.62	0.06	0.74	1.162

row = source token / Query，column = target token / Key。例如表格中的 cat 行、cloud 列，就是图中的 q_cat -> k_cloud 边。这个负分数会让 cat 对 cloud 的 attention 明显变小。

3. scaled scores 再经过 row-wise softmax

标准写法是 S = (QK.T) / sqrt(d_k)。toy example 中 d_k = 2，所以先除以 sqrt(2)，再对每一行做 softmax：

attention = softmax((Q @ K.T) / sqrt(2))
attention = [
    [0.379, 0.286, 0.215, 0.120],
    [0.272, 0.272, 0.272, 0.185],
    [0.180, 0.239, 0.317, 0.264],
    [0.114, 0.185, 0.299, 0.403],
]

这张矩阵表示每个 source token 如何把 100% attention 分配给所有 target tokens。注意第一行里 cat -> cloud 只有 0.120，低于 cat -> cat 的 0.379。

4. attention @ V：更新每个点的位置

标准数学写法是 $o_i = \sum_{j=1}^{n} a_{ij}v_j$ 。空间上可以理解为：

output point of token i = weighted average of value points

以 cat 为例，注意这里混合的是 V points，不是原始 X points：

v_cat   = (0.9, 0.1)
v_likes = (0.5, 0.5)
v_fish  = (0.1, 0.9)
v_cloud = (-0.63, 0.73)

out_cat
= 0.379 * (0.9, 0.1) + 0.286 * (0.5, 0.5)
+ 0.215 * (0.1, 0.9) + 0.120 * (-0.63, 0.73)
≈ (0.429, 0.462)

以 likes 为例：

out_likes
= 0.272 * (0.9, 0.1) + 0.272 * (0.5, 0.5)
+ 0.272 * (0.1, 0.9) + 0.185 * (-0.63, 0.73)
≈ (0.291, 0.543)

token	V point	attention row	output point
`cat`	`(0.900, 0.100)`	`[0.379, 0.286, 0.215, 0.120]`	`(0.429, 0.462)`
`likes`	`(0.500, 0.500)`	`[0.272, 0.272, 0.272, 0.185]`	`(0.291, 0.543)`
`fish`	`(0.100, 0.900)`	`[0.180, 0.239, 0.317, 0.264]`	`(0.147, 0.615)`
`cloud`	`(-0.630, 0.730)`	`[0.114, 0.185, 0.299, 0.403]`	`(-0.029, 0.666)`

5. 从空间结构到关系传播

spatial view	Transformer view
空间坐标	embedding / token point
空间变换	`Wq/Wk/Wv`
邻域关系	attention matrix `A`
结构传播	`attention @ V`

一句话：Self-Attention 先用 Query 和 Key 决定关系权重，再用这些权重去混合 Value；疏远 token 会因为 score 较低而贡献更少。Transformer 学到的，是一种会随训练调整的关系结构表征。

从 Self-Attention 到 Tiny Transformer

1. Self-Attention 还不是完整 Transformer

前面 1~6 已经讲清楚 Q/K/V -> S -> A -> O：Self-Attention 负责回答每个 token 应该听谁、听多少，并把 Value 信息融合回来。但一个可以训练的小型 Transformer 还需要把 token 变成向量、加入位置信息、堆上前馈网络，并输出预测结果。

2. Tiny Transformer 的基本流程

token ids
-> embedding + position encoding
-> multi-head self-attention
-> add & norm
-> feed forward network
-> output logits

实验台中的模型会学习一个很小的 next-token task，例如 cat likes -> fish。它不是为了追求大模型效果，而是为了让你看到训练时 loss、预测概率、attention、Q/K/V 和词图如何一起变化。

3. 为什么需要 Position Encoding

Self-Attention 本身主要计算 token 之间的关系。如果没有位置信息，模型很难区分 cat likes fish 和 fish likes cat 这种词相同但顺序不同的情况。实验台里的 Use Position 开关就是用来观察位置编码是否参与表示。

4. 为什么需要 Multi-Head

一个 head 可以看成一个关系空间；多个 head 则是多组 Wq/Wk/Wv 并行工作，在不同子空间里观察不同关系。例如一个 head 可能更关注主语和宾语，另一个 head 可能更关注局部顺序或语义相似。前面讲的“空间变换”在 multi-head 中会变成多套并行几何观察方式。

5. 实验台参数怎么对应模型

parameter	meaning
`D Model`	每个 token embedding 的维度
`Heads`	并行 attention head 的数量
`Learning Rate`	梯度下降每次更新参数的步长
`Epochs`	训练轮数
`Use Position`	是否加入位置编码
`Edge Threshold`	3D 词图中显示 attention 边的阈值

接下来的训练实验台会把前面静态手算的 Self-Attention，变成一个可以更新参数、观察训练过程的 Tiny Transformer。

Tiny Transformer 训练实验台

这个实验台训练一个极小的 Tiny Transformer，用 6 条三词句子学习 next-token prediction。输入是前两个 token，目标是第三个 token。 Training corpus: 1. cat likes -> fish 2. dog likes -> bone 3. bird likes -> worm 4. cat eats -> fish 5. dog eats -> bone 6. bird eats -> worm Vocabulary: bird, bone, cat, dog, eats, fish, likes, worm。模型流程是 embedding + optional position encoding -> multi-head self-attention -> feed forward -> output logits。训练参数会影响下一次运行：D Model 控制 embedding 维度，Heads 控制并行 attention head 数，Learning Rate 控制梯度下降步长，Epochs 控制训练轮数，Watch Every 控制保存观察帧的间隔，Use Position 控制是否加入位置编码，Seed 控制随机初始化。观察参数在运行后启用：View Epoch 用于切换已保存训练帧，Edge Threshold 用于筛选 3D attention graph 的边。拖动观察参数只更新图表和数值，不会重新训练模型。

Parameter Panel

9 Params

D Model16

Heads2

Learning Rate0.03

Epochs50

View Epoch25

Watch Every20

Edge Threshold0.16

Use Position

usePositionEncoding

Seed7

先运行 Tiny Transformer。运行完成后，View Epoch 和 Edge Threshold 会启用；拖动它们只切换已保存的训练帧和边阈值，不会重新训练模型。

Self-Attention 结构、矩阵与投影空间观察

这个观察卡片读取 Tiny Transformer 训练实验台运行后保存的 transformer_trace。运行前显示示例帧，因此观察 epoch 滑块会是灰色；运行后可拖动 epoch、切换 input -> target 和 attention head，观察当前样本的 Q/K/V 投影空间、attention flow、attention heatmap、scores/Q/K/V 矩阵和预测概率。

epoch

loss

1.9200

input to target

cat likes -> fish

prediction

bone

Q/K/V Projection Geometry + Attention Flow

The same input tokens are projected into Q, K, and V spaces. Edge width shows attention weight; ring points are mixed outputs from attention @ V.

cat likes -> fish

Q Space: what each token asks for

K Space: how each token can be matched

Attention Flow: Q points look at K points

V Space and mixed outputs

观察 epoch0

input -> target

Attention head

预测概率

fish

0.20

bone

0.24

worm

0.18

likes

0.13

Self-Attention Heatmap | Head 1

cat

likes

cat

0.55

0.45

likes

0.36

0.64

Average Attention Across Heads

cat

likes

cat

0.52

0.48

likes

0.43

0.56

scores = QK^T / sqrt(d)

[0.170, -0.040]
[-0.210, 0.360]

[0.340, -0.180, 0.280, 0.110]
[0.080, 0.310, -0.130, 0.240]

[0.160, -0.270, 0.180, 0.220]
[0.250, 0.100, -0.160, 0.300]

[0.220, 0.140, -0.100, 0.270]
[0.040, 0.300, 0.120, -0.200]

当前显示的是示例帧。运行上方 Tiny Transformer 实验后，这里会切换为真实训练产生的 attention、词向量和预测结果。

拓展

延伸提问