Appearance
04D-Layer4-例子-AttentionLayer到return
位置
当前文档只解释一个问题:
AttentionLayer.forward(...)里x怎样变成Q/K/V,怎样进入FullAttention.forward(...),最后怎样回到return self.out_projection(out), attn。
0. 入口接口
在 PatchTST 的 Encoder 里,这一层实际走的是 self-attention:
python
new_x, attn = self.attention(x, x, x, attn_mask=attn_mask, tau=tau, delta=delta)所以进入 AttentionLayer.forward(...) 时:
text
queries = x
keys = x
values = x本文 toy 例子固定:
text
B = 1
L = 2
S = 2
d_model = 4
n_heads = 2
head_dim = d_model // n_heads = 2
dropout = 0
mask_flag = False输入 x 是 2 个 token,每个 token 4 维:
text
x = queries = keys = values
shape = [B, L, d_model] = [1, 2, 4]
token1 = [1, 0, 1, 0]
token2 = [0, 1, 0, 1]
x =
[
[
[1, 0, 1, 0],
[0, 1, 0, 1]
]
]语义对应到 PatchTST:
text
B = batch 中第几个样本
L / S = patch token 数
d_model = 每个 patch token 的隐藏表示维度
n_heads = 把 d_model 拆成几个注意力头
head_dim = 每个 head 单独做注意力时使用的维度1. 总流程图
抽象树:
text
AttentionLayer.forward
├─ 1. 读取 B L S H
├─ 2. 三个 Linear 生成 Q K V
│ ├─ query_projection
│ ├─ key_projection
│ └─ value_projection
├─ 3. view 拆成多头
├─ 4. inner_attention 计算注意力
│ ├─ scores = QK^T
│ ├─ A = softmax(scores)
│ └─ V_attn = A V
├─ 5. view 合并多头
└─ 6. out_projection 后 return2. 真实代码
位置:
text
ts_benchmark/baselines/time_series_library/layers/SelfAttention_Family.py
class AttentionLayer.forwardpython
def forward(self, queries, keys, values, attn_mask, tau=None, delta=None):
B, L, _ = queries.shape
_, S, _ = keys.shape
H = self.n_heads
queries = self.query_projection(queries).view(B, L, H, -1)
keys = self.key_projection(keys).view(B, S, H, -1)
values = self.value_projection(values).view(B, S, H, -1)
out, attn = self.inner_attention(
queries, keys, values, attn_mask, tau=tau, delta=delta
)
out = out.view(B, L, -1)
return self.out_projection(out), attn中文注释版:
python
def forward(self, queries, keys, values, attn_mask, tau=None, delta=None):
# queries: [B, L, d_model]
# keys: [B, S, d_model]
# values: [B, S, d_model]
B, L, _ = queries.shape
_, S, _ = keys.shape
# H 是多头数,来自 AttentionLayer.__init__(..., n_heads)
H = self.n_heads
# 每个 Linear 都先把 d_model 映射到 n_heads * head_dim
# 然后 view 成 [B, token数, n_heads, head_dim]
queries = self.query_projection(queries).view(B, L, H, -1)
keys = self.key_projection(keys).view(B, S, H, -1)
values = self.value_projection(values).view(B, S, H, -1)
# 真正的注意力计算交给 inner_attention
# PatchTST 这里通常是 FullAttention
out, attn = self.inner_attention(
queries, keys, values, attn_mask, tau=tau, delta=delta
)
# FullAttention 返回 [B, L, H, head_dim]
# 这里把 H 和 head_dim 合并回 d_model
out = out.view(B, L, -1)
# 最后再做一个输出线性层,回到 [B, L, d_model]
return self.out_projection(out), attn3. 第一步:读 shape 和 n_heads
对应代码:
python
B, L, _ = queries.shape
_, S, _ = keys.shape
H = self.n_heads代入 toy:
text
queries.shape = [1, 2, 4]
keys.shape = [1, 2, 4]
B = 1
L = 2
S = 2
H = 2这里 H = self.n_heads 的来源是初始化时:
python
self.n_heads = n_heads所以 n_heads 不是在 forward 里重新传给某个函数,而是已经保存在当前 AttentionLayer 对象里。
4. 第二步:Linear 生成 Q K V
对应代码:
python
queries = self.query_projection(queries).view(B, L, H, -1)
keys = self.key_projection(keys).view(B, S, H, -1)
values = self.value_projection(values).view(B, S, H, -1)为了能手算,本文把三个 Linear 都设成单位映射:
text
Wq = I4
Wk = I4
Wv = I4
bq = bk = bv = 0真实训练时这些权重是可学习参数,不会是单位矩阵。这里用单位矩阵只是为了看清维度和数据怎样流动。
4.1 query_projection
输入:
text
queries before Linear
shape = [1, 2, 4]
[
[
[1, 0, 1, 0],
[0, 1, 0, 1]
]
]线性层:
text
query_projection: Linear(d_model, d_keys * n_heads)
= Linear(4, 2 * 2)
= Linear(4, 4)因为 Wq = I4:
text
queries after Linear
shape = [1, 2, 4]
[
[
[1, 0, 1, 0],
[0, 1, 0, 1]
]
]然后:
python
.view(B, L, H, -1)变成:
text
queries after view
shape = [1, 2, 2, 2]
B=0:
token1:
head0 = [1, 0]
head1 = [1, 0]
token2:
head0 = [0, 1]
head1 = [0, 1]图:
text
token1 [1, 0, 1, 0]
├─ head0 [1, 0]
└─ head1 [1, 0]
token2 [0, 1, 0, 1]
├─ head0 [0, 1]
└─ head1 [0, 1]4.2 key_projection
因为 keys = x,且 Wk = I4:
text
keys after view
shape = [1, 2, 2, 2]
token1:
head0 = [1, 0]
head1 = [1, 0]
token2:
head0 = [0, 1]
head1 = [0, 1]4.3 value_projection
因为 values = x,且 Wv = I4:
text
values after view
shape = [1, 2, 2, 2]
token1:
head0 = [1, 0]
head1 = [1, 0]
token2:
head0 = [0, 1]
head1 = [0, 1]到这里,AttentionLayer 已经完成了自己的第一件核心工作:
text
把每个 token 的 d_model=4 拆成 n_heads=2 个 head,每个 head_dim=2。5. 第三步:进入 FullAttention
对应代码:
python
out, attn = self.inner_attention(
queries, keys, values, attn_mask, tau=tau, delta=delta
)PatchTST 中:
text
self.inner_attention = FullAttention(...)所以这里实际进入:
python
FullAttention.forward(queries, keys, values, attn_mask, tau=None, delta=None)FullAttention 关键代码:
python
def forward(self, queries, keys, values, attn_mask, tau=None, delta=None):
B, L, H, E = queries.shape
_, S, _, D = values.shape
scale = self.scale or 1.0 / sqrt(E)
scores = torch.einsum("blhe,bshe->bhls", queries, keys)
if self.mask_flag:
if attn_mask is None:
attn_mask = TriangularCausalMask(B, L, device=queries.device)
scores.masked_fill_(attn_mask.mask, -np.inf)
A = self.dropout(torch.softmax(scale * scores, dim=-1))
V = torch.einsum("bhls,bshd->blhd", A, values)
if self.output_attention:
return V.contiguous(), A
else:
return V.contiguous(), None中文注释版:
python
def forward(self, queries, keys, values, attn_mask, tau=None, delta=None):
# queries: [B, L, H, E]
# keys: [B, S, H, E]
# values: [B, S, H, D]
B, L, H, E = queries.shape
_, S, _, D = values.shape
# E 是每个 head 的维度,toy 里 E=2
scale = self.scale or 1.0 / sqrt(E)
# 对每个 batch、每个 head,计算 query token 和 key token 的点积
# 输出 scores: [B, H, L, S]
scores = torch.einsum("blhe,bshe->bhls", queries, keys)
# PatchTST Encoder 里 mask_flag=False,所以 toy 里不做 causal mask
if self.mask_flag:
if attn_mask is None:
attn_mask = TriangularCausalMask(B, L, device=queries.device)
scores.masked_fill_(attn_mask.mask, -np.inf)
# 对每个 query token,在所有 key token 方向上 softmax
A = self.dropout(torch.softmax(scale * scores, dim=-1))
# 用注意力权重 A 加权求和 values
# 输出 V: [B, L, H, D]
V = torch.einsum("bhls,bshd->blhd", A, values)
if self.output_attention:
return V.contiguous(), A
else:
return V.contiguous(), None6. 第四步:计算 scores = QK^T
对应代码:
python
scores = torch.einsum("blhe,bshe->bhls", queries, keys)维度解释:
text
queries: [B, L, H, E]
keys: [B, S, H, E]
scores: [B, H, L, S]在 toy 中:
text
B = 1
L = 2
S = 2
H = 2
E = 2所以:
text
scores.shape = [1, 2, 2, 2]6.1 head0 的 Q K
text
Q_head0 =
[
[1, 0], # token1
[0, 1] # token2
]
K_head0 =
[
[1, 0], # token1
[0, 1] # token2
]点积矩阵:
text
scores_head0 = Q_head0 @ K_head0.T
key1 key2
query1 1 0
query2 0 1具体算:
text
query1 对 key1: [1,0] · [1,0] = 1
query1 对 key2: [1,0] · [0,1] = 0
query2 对 key1: [0,1] · [1,0] = 0
query2 对 key2: [0,1] · [0,1] = 16.2 head1 的 Q K
因为 toy 里 head1 和 head0 数值相同:
text
scores_head1 =
key1 key2
query1 1 0
query2 0 1合起来:
text
scores =
[
head0:
[
[1, 0],
[0, 1]
],
head1:
[
[1, 0],
[0, 1]
]
]7. 第五步:scale 和 softmax 得到注意力权重 A
对应代码:
python
scale = self.scale or 1.0 / sqrt(E)
A = self.dropout(torch.softmax(scale * scores, dim=-1))toy 里:
text
E = head_dim = 2
scale = 1 / sqrt(2) ≈ 0.707对 head0:
text
scale * scores_head0 =
key1 key2
query1 0.707 0
query2 0 0.707对每一行做 softmax:
text
softmax([0.707, 0]) ≈ [0.669, 0.331]
softmax([0, 0.707]) ≈ [0.331, 0.669]所以:
text
A_head0 =
key1 key2
query1 0.669 0.331
query2 0.331 0.669head1 同理:
text
A_head1 =
key1 key2
query1 0.669 0.331
query2 0.331 0.669注意力权重的语义:
text
A[head, query, key]
= 当前 head 中,某个 query token 应该从每个 key/value token 取多少信息。8. 第六步:V = A @ values
对应代码:
python
V = torch.einsum("bhls,bshd->blhd", A, values)维度解释:
text
A: [B, H, L, S]
values: [B, S, H, D]
V: [B, L, H, D]8.1 head0 的 values
text
V_input_head0 =
[
[1, 0], # value token1
[0, 1] # value token2
]head0 的注意力权重:
text
A_head0 =
[
[0.669, 0.331], # query1 对 value1/value2 的权重
[0.331, 0.669] # query2 对 value1/value2 的权重
]加权求和:
text
output query1 head0
= 0.669 * [1, 0] + 0.331 * [0, 1]
= [0.669, 0.331]
output query2 head0
= 0.331 * [1, 0] + 0.669 * [0, 1]
= [0.331, 0.669]所以:
text
V_output_head0 =
[
[0.669, 0.331],
[0.331, 0.669]
]8.2 head1 的 values
head1 的 values:
text
V_input_head1 =
[
[1, 0],
[0, 1]
]同样得到:
text
V_output_head1 =
[
[0.669, 0.331],
[0.331, 0.669]
]8.3 FullAttention 返回给 AttentionLayer 的 out
FullAttention.forward(...) 返回:
text
out = V.contiguous()
shape = [B, L, H, D] = [1, 2, 2, 2]数值:
text
out =
[
token1:
head0 [0.669, 0.331]
head1 [0.669, 0.331]
token2:
head0 [0.331, 0.669]
head1 [0.331, 0.669]
]如果 output_attention=False:
text
attn = None如果 output_attention=True:
text
attn = A
shape = [B, H, L, S]9. 第七步:合并 heads
回到 AttentionLayer.forward(...)。
对应代码:
python
out = out.view(B, L, -1)输入:
text
out before view
shape = [1, 2, 2, 2]
token1:
head0 [0.669, 0.331]
head1 [0.669, 0.331]
token2:
head0 [0.331, 0.669]
head1 [0.331, 0.669]合并 head:
text
out after view
shape = [1, 2, 4]
token1 = [0.669, 0.331, 0.669, 0.331]
token2 = [0.331, 0.669, 0.331, 0.669]图:
text
token1:
head0 [0.669, 0.331] + head1 [0.669, 0.331]
-> [0.669, 0.331, 0.669, 0.331]
token2:
head0 [0.331, 0.669] + head1 [0.331, 0.669]
-> [0.331, 0.669, 0.331, 0.669]10. 第八步:out_projection 后 return
对应代码:
python
return self.out_projection(out), attnout_projection 的定义:
python
self.out_projection = nn.Linear(d_values * n_heads, d_model)toy 里:
text
d_values = 2
n_heads = 2
d_values * n_heads = 4
d_model = 4
out_projection = Linear(4, 4)为了手算,设:
text
Wo = I4
bo = 0所以:
text
output = out_projection(out)
shape = [1, 2, 4]
token1 = [0.669, 0.331, 0.669, 0.331]
token2 = [0.331, 0.669, 0.331, 0.669]最终返回:
text
return output, attn其中:
text
output.shape = [B, L, d_model] = [1, 2, 4]
attn = None 或 [B, H, L, S]PatchTST 后续会把这个 output 当作 EncoderLayer 的 new_x:
python
x = x + self.dropout(new_x)也就是说,AttentionLayer 本身不会完成整个 EncoderLayer,它只负责:
text
输入 patch token 表示
-> 计算 token 之间的注意力混合
-> 输出同 shape 的新 token 表示11. 一句话总结
n_heads 的作用不是改变 token 个数,而是把每个 token 的 d_model 维隐藏向量切成多个小空间分别做注意力。
在 toy 里:
text
[1, 2, 4]
-> Q/K/V Linear
-> [1, 2, 2, 2]
-> 每个 head 单独算 scores 和 A V
-> [1, 2, 2, 2]
-> 合并 heads
-> [1, 2, 4]
-> out_projection
-> [1, 2, 4]所以 AttentionLayer.forward(...) 的输入输出 shape 不变:
text
输入: [B, L, d_model]
输出: [B, L, d_model]但中间通过:
text
d_model -> n_heads * head_dim让不同 head 在不同子空间里分别学习 token 之间的依赖。