Skip to content

04D-Layer4-例子-AttentionLayer到return

位置

父节点:04B-Layer4-AttentionLayer

兄弟节点:04C-Layer5-FullAttention

当前文档只解释一个问题:AttentionLayer.forward(...)x 怎样变成 Q/K/V,怎样进入 FullAttention.forward(...),最后怎样回到 return self.out_projection(out), attn


0. 入口接口

在 PatchTST 的 Encoder 里,这一层实际走的是 self-attention:

python
new_x, attn = self.attention(x, x, x, attn_mask=attn_mask, tau=tau, delta=delta)

所以进入 AttentionLayer.forward(...) 时:

text
queries = x
keys    = x
values  = x

本文 toy 例子固定:

text
B = 1
L = 2
S = 2
d_model = 4
n_heads = 2
head_dim = d_model // n_heads = 2
dropout = 0
mask_flag = False

输入 x 是 2 个 token,每个 token 4 维:

text
x = queries = keys = values
shape = [B, L, d_model] = [1, 2, 4]

token1 = [1, 0, 1, 0]
token2 = [0, 1, 0, 1]

x =
[
  [
    [1, 0, 1, 0],
    [0, 1, 0, 1]
  ]
]

语义对应到 PatchTST:

text
B       = batch 中第几个样本
L / S   = patch token 数
d_model = 每个 patch token 的隐藏表示维度
n_heads = 把 d_model 拆成几个注意力头
head_dim = 每个 head 单独做注意力时使用的维度

1. 总流程图

抽象树:

text
AttentionLayer.forward
├─ 1. 读取 B L S H
├─ 2. 三个 Linear 生成 Q K V
│  ├─ query_projection
│  ├─ key_projection
│  └─ value_projection
├─ 3. view 拆成多头
├─ 4. inner_attention 计算注意力
│  ├─ scores = QK^T
│  ├─ A = softmax(scores)
│  └─ V_attn = A V
├─ 5. view 合并多头
└─ 6. out_projection 后 return

2. 真实代码

位置:

text
ts_benchmark/baselines/time_series_library/layers/SelfAttention_Family.py
class AttentionLayer.forward
python
def forward(self, queries, keys, values, attn_mask, tau=None, delta=None):
    B, L, _ = queries.shape
    _, S, _ = keys.shape
    H = self.n_heads

    queries = self.query_projection(queries).view(B, L, H, -1)
    keys = self.key_projection(keys).view(B, S, H, -1)
    values = self.value_projection(values).view(B, S, H, -1)

    out, attn = self.inner_attention(
        queries, keys, values, attn_mask, tau=tau, delta=delta
    )
    out = out.view(B, L, -1)

    return self.out_projection(out), attn

中文注释版:

python
def forward(self, queries, keys, values, attn_mask, tau=None, delta=None):
    # queries: [B, L, d_model]
    # keys:    [B, S, d_model]
    # values:  [B, S, d_model]
    B, L, _ = queries.shape
    _, S, _ = keys.shape

    # H 是多头数,来自 AttentionLayer.__init__(..., n_heads)
    H = self.n_heads

    # 每个 Linear 都先把 d_model 映射到 n_heads * head_dim
    # 然后 view 成 [B, token数, n_heads, head_dim]
    queries = self.query_projection(queries).view(B, L, H, -1)
    keys = self.key_projection(keys).view(B, S, H, -1)
    values = self.value_projection(values).view(B, S, H, -1)

    # 真正的注意力计算交给 inner_attention
    # PatchTST 这里通常是 FullAttention
    out, attn = self.inner_attention(
        queries, keys, values, attn_mask, tau=tau, delta=delta
    )

    # FullAttention 返回 [B, L, H, head_dim]
    # 这里把 H 和 head_dim 合并回 d_model
    out = out.view(B, L, -1)

    # 最后再做一个输出线性层,回到 [B, L, d_model]
    return self.out_projection(out), attn

3. 第一步:读 shape 和 n_heads

对应代码:

python
B, L, _ = queries.shape
_, S, _ = keys.shape
H = self.n_heads

代入 toy:

text
queries.shape = [1, 2, 4]
keys.shape    = [1, 2, 4]

B = 1
L = 2
S = 2
H = 2

这里 H = self.n_heads 的来源是初始化时:

python
self.n_heads = n_heads

所以 n_heads 不是在 forward 里重新传给某个函数,而是已经保存在当前 AttentionLayer 对象里。


4. 第二步:Linear 生成 Q K V

对应代码:

python
queries = self.query_projection(queries).view(B, L, H, -1)
keys = self.key_projection(keys).view(B, S, H, -1)
values = self.value_projection(values).view(B, S, H, -1)

为了能手算,本文把三个 Linear 都设成单位映射:

text
Wq = I4
Wk = I4
Wv = I4
bq = bk = bv = 0

真实训练时这些权重是可学习参数,不会是单位矩阵。这里用单位矩阵只是为了看清维度和数据怎样流动。

4.1 query_projection

输入:

text
queries before Linear
shape = [1, 2, 4]

[
  [
    [1, 0, 1, 0],
    [0, 1, 0, 1]
  ]
]

线性层:

text
query_projection: Linear(d_model, d_keys * n_heads)
                 = Linear(4, 2 * 2)
                 = Linear(4, 4)

因为 Wq = I4

text
queries after Linear
shape = [1, 2, 4]

[
  [
    [1, 0, 1, 0],
    [0, 1, 0, 1]
  ]
]

然后:

python
.view(B, L, H, -1)

变成:

text
queries after view
shape = [1, 2, 2, 2]

B=0:
token1:
  head0 = [1, 0]
  head1 = [1, 0]

token2:
  head0 = [0, 1]
  head1 = [0, 1]

图:

text
token1 [1, 0, 1, 0]
        ├─ head0 [1, 0]
        └─ head1 [1, 0]

token2 [0, 1, 0, 1]
        ├─ head0 [0, 1]
        └─ head1 [0, 1]

4.2 key_projection

因为 keys = x,且 Wk = I4

text
keys after view
shape = [1, 2, 2, 2]

token1:
  head0 = [1, 0]
  head1 = [1, 0]

token2:
  head0 = [0, 1]
  head1 = [0, 1]

4.3 value_projection

因为 values = x,且 Wv = I4

text
values after view
shape = [1, 2, 2, 2]

token1:
  head0 = [1, 0]
  head1 = [1, 0]

token2:
  head0 = [0, 1]
  head1 = [0, 1]

到这里,AttentionLayer 已经完成了自己的第一件核心工作:

text
把每个 token 的 d_model=4 拆成 n_heads=2 个 head,每个 head_dim=2。

5. 第三步:进入 FullAttention

对应代码:

python
out, attn = self.inner_attention(
    queries, keys, values, attn_mask, tau=tau, delta=delta
)

PatchTST 中:

text
self.inner_attention = FullAttention(...)

所以这里实际进入:

python
FullAttention.forward(queries, keys, values, attn_mask, tau=None, delta=None)

FullAttention 关键代码:

python
def forward(self, queries, keys, values, attn_mask, tau=None, delta=None):
    B, L, H, E = queries.shape
    _, S, _, D = values.shape
    scale = self.scale or 1.0 / sqrt(E)

    scores = torch.einsum("blhe,bshe->bhls", queries, keys)

    if self.mask_flag:
        if attn_mask is None:
            attn_mask = TriangularCausalMask(B, L, device=queries.device)

        scores.masked_fill_(attn_mask.mask, -np.inf)

    A = self.dropout(torch.softmax(scale * scores, dim=-1))
    V = torch.einsum("bhls,bshd->blhd", A, values)

    if self.output_attention:
        return V.contiguous(), A
    else:
        return V.contiguous(), None

中文注释版:

python
def forward(self, queries, keys, values, attn_mask, tau=None, delta=None):
    # queries: [B, L, H, E]
    # keys:    [B, S, H, E]
    # values:  [B, S, H, D]
    B, L, H, E = queries.shape
    _, S, _, D = values.shape

    # E 是每个 head 的维度,toy 里 E=2
    scale = self.scale or 1.0 / sqrt(E)

    # 对每个 batch、每个 head,计算 query token 和 key token 的点积
    # 输出 scores: [B, H, L, S]
    scores = torch.einsum("blhe,bshe->bhls", queries, keys)

    # PatchTST Encoder 里 mask_flag=False,所以 toy 里不做 causal mask
    if self.mask_flag:
        if attn_mask is None:
            attn_mask = TriangularCausalMask(B, L, device=queries.device)

        scores.masked_fill_(attn_mask.mask, -np.inf)

    # 对每个 query token,在所有 key token 方向上 softmax
    A = self.dropout(torch.softmax(scale * scores, dim=-1))

    # 用注意力权重 A 加权求和 values
    # 输出 V: [B, L, H, D]
    V = torch.einsum("bhls,bshd->blhd", A, values)

    if self.output_attention:
        return V.contiguous(), A
    else:
        return V.contiguous(), None

6. 第四步:计算 scores = QK^T

对应代码:

python
scores = torch.einsum("blhe,bshe->bhls", queries, keys)

维度解释:

text
queries: [B, L, H, E]
keys:    [B, S, H, E]
scores:  [B, H, L, S]

在 toy 中:

text
B = 1
L = 2
S = 2
H = 2
E = 2

所以:

text
scores.shape = [1, 2, 2, 2]

6.1 head0 的 Q K

text
Q_head0 =
[
  [1, 0],   # token1
  [0, 1]    # token2
]

K_head0 =
[
  [1, 0],   # token1
  [0, 1]    # token2
]

点积矩阵:

text
scores_head0 = Q_head0 @ K_head0.T

             key1  key2
query1        1     0
query2        0     1

具体算:

text
query1 对 key1: [1,0] · [1,0] = 1
query1 对 key2: [1,0] · [0,1] = 0
query2 对 key1: [0,1] · [1,0] = 0
query2 对 key2: [0,1] · [0,1] = 1

6.2 head1 的 Q K

因为 toy 里 head1 和 head0 数值相同:

text
scores_head1 =

             key1  key2
query1        1     0
query2        0     1

合起来:

text
scores =
[
  head0:
    [
      [1, 0],
      [0, 1]
    ],

  head1:
    [
      [1, 0],
      [0, 1]
    ]
]

7. 第五步:scale 和 softmax 得到注意力权重 A

对应代码:

python
scale = self.scale or 1.0 / sqrt(E)
A = self.dropout(torch.softmax(scale * scores, dim=-1))

toy 里:

text
E = head_dim = 2
scale = 1 / sqrt(2) ≈ 0.707

对 head0:

text
scale * scores_head0 =

             key1   key2
query1      0.707  0
query2      0      0.707

对每一行做 softmax:

text
softmax([0.707, 0]) ≈ [0.669, 0.331]
softmax([0, 0.707]) ≈ [0.331, 0.669]

所以:

text
A_head0 =

             key1   key2
query1      0.669  0.331
query2      0.331  0.669

head1 同理:

text
A_head1 =

             key1   key2
query1      0.669  0.331
query2      0.331  0.669

注意力权重的语义:

text
A[head, query, key]
= 当前 head 中,某个 query token 应该从每个 key/value token 取多少信息。

8. 第六步:V = A @ values

对应代码:

python
V = torch.einsum("bhls,bshd->blhd", A, values)

维度解释:

text
A:      [B, H, L, S]
values: [B, S, H, D]
V:      [B, L, H, D]

8.1 head0 的 values

text
V_input_head0 =
[
  [1, 0],   # value token1
  [0, 1]    # value token2
]

head0 的注意力权重:

text
A_head0 =
[
  [0.669, 0.331],  # query1 对 value1/value2 的权重
  [0.331, 0.669]   # query2 对 value1/value2 的权重
]

加权求和:

text
output query1 head0
= 0.669 * [1, 0] + 0.331 * [0, 1]
= [0.669, 0.331]

output query2 head0
= 0.331 * [1, 0] + 0.669 * [0, 1]
= [0.331, 0.669]

所以:

text
V_output_head0 =
[
  [0.669, 0.331],
  [0.331, 0.669]
]

8.2 head1 的 values

head1 的 values:

text
V_input_head1 =
[
  [1, 0],
  [0, 1]
]

同样得到:

text
V_output_head1 =
[
  [0.669, 0.331],
  [0.331, 0.669]
]

8.3 FullAttention 返回给 AttentionLayer 的 out

FullAttention.forward(...) 返回:

text
out = V.contiguous()
shape = [B, L, H, D] = [1, 2, 2, 2]

数值:

text
out =
[
  token1:
    head0 [0.669, 0.331]
    head1 [0.669, 0.331]

  token2:
    head0 [0.331, 0.669]
    head1 [0.331, 0.669]
]

如果 output_attention=False

text
attn = None

如果 output_attention=True

text
attn = A
shape = [B, H, L, S]

9. 第七步:合并 heads

回到 AttentionLayer.forward(...)

对应代码:

python
out = out.view(B, L, -1)

输入:

text
out before view
shape = [1, 2, 2, 2]

token1:
  head0 [0.669, 0.331]
  head1 [0.669, 0.331]

token2:
  head0 [0.331, 0.669]
  head1 [0.331, 0.669]

合并 head:

text
out after view
shape = [1, 2, 4]

token1 = [0.669, 0.331, 0.669, 0.331]
token2 = [0.331, 0.669, 0.331, 0.669]

图:

text
token1:
  head0 [0.669, 0.331] + head1 [0.669, 0.331]
  -> [0.669, 0.331, 0.669, 0.331]

token2:
  head0 [0.331, 0.669] + head1 [0.331, 0.669]
  -> [0.331, 0.669, 0.331, 0.669]

10. 第八步:out_projection 后 return

对应代码:

python
return self.out_projection(out), attn

out_projection 的定义:

python
self.out_projection = nn.Linear(d_values * n_heads, d_model)

toy 里:

text
d_values = 2
n_heads = 2
d_values * n_heads = 4
d_model = 4

out_projection = Linear(4, 4)

为了手算,设:

text
Wo = I4
bo = 0

所以:

text
output = out_projection(out)
shape = [1, 2, 4]

token1 = [0.669, 0.331, 0.669, 0.331]
token2 = [0.331, 0.669, 0.331, 0.669]

最终返回:

text
return output, attn

其中:

text
output.shape = [B, L, d_model] = [1, 2, 4]
attn = None 或 [B, H, L, S]

PatchTST 后续会把这个 output 当作 EncoderLayer 的 new_x

python
x = x + self.dropout(new_x)

也就是说,AttentionLayer 本身不会完成整个 EncoderLayer,它只负责:

text
输入 patch token 表示
-> 计算 token 之间的注意力混合
-> 输出同 shape 的新 token 表示

11. 一句话总结

n_heads 的作用不是改变 token 个数,而是把每个 token 的 d_model 维隐藏向量切成多个小空间分别做注意力。

在 toy 里:

text
[1, 2, 4]
-> Q/K/V Linear
-> [1, 2, 2, 2]
-> 每个 head 单独算 scores 和 A V
-> [1, 2, 2, 2]
-> 合并 heads
-> [1, 2, 4]
-> out_projection
-> [1, 2, 4]

所以 AttentionLayer.forward(...) 的输入输出 shape 不变:

text
输入: [B, L, d_model]
输出: [B, L, d_model]

但中间通过:

text
d_model -> n_heads * head_dim

让不同 head 在不同子空间里分别学习 token 之间的依赖。

*记录并在线阅读我的笔记*