04D-Layer4-例子-AttentionLayer到return

位置

父节点：04B-Layer4-AttentionLayer
兄弟节点：04C-Layer5-FullAttention
当前文档只解释一个问题：AttentionLayer.forward(...) 里 x 怎样变成 Q/K/V，怎样进入 FullAttention.forward(...)，最后怎样回到 return self.out_projection(out), attn。

0. 入口接口

在 PatchTST 的 Encoder 里，这一层实际走的是 self-attention：

python

new_x, attn = self.attention(x, x, x, attn_mask=attn_mask, tau=tau, delta=delta)

所以进入 AttentionLayer.forward(...) 时：

text

queries = x
keys    = x
values  = x

本文 toy 例子固定：

text

B = 1
L = 2
S = 2
d_model = 4
n_heads = 2
head_dim = d_model // n_heads = 2
dropout = 0
mask_flag = False

输入 x 是 2 个 token，每个 token 4 维：

text

x = queries = keys = values
shape = [B, L, d_model] = [1, 2, 4]

token1 = [1, 0, 1, 0]
token2 = [0, 1, 0, 1]

x =
[
  [
    [1, 0, 1, 0],
    [0, 1, 0, 1]
  ]
]

语义对应到 PatchTST：

text

B       = batch 中第几个样本
L / S   = patch token 数
d_model = 每个 patch token 的隐藏表示维度
n_heads = 把 d_model 拆成几个注意力头
head_dim = 每个 head 单独做注意力时使用的维度

1. 总流程图

抽象树：

text

AttentionLayer.forward
├─ 1. 读取 B L S H
├─ 2. 三个 Linear 生成 Q K V
│  ├─ query_projection
│  ├─ key_projection
│  └─ value_projection
├─ 3. view 拆成多头
├─ 4. inner_attention 计算注意力
│  ├─ scores = QK^T
│  ├─ A = softmax(scores)
│  └─ V_attn = A V
├─ 5. view 合并多头
└─ 6. out_projection 后 return

2. 真实代码

位置：

text

ts_benchmark/baselines/time_series_library/layers/SelfAttention_Family.py
class AttentionLayer.forward

python

def forward(self, queries, keys, values, attn_mask, tau=None, delta=None):
    B, L, _ = queries.shape
    _, S, _ = keys.shape
    H = self.n_heads

    queries = self.query_projection(queries).view(B, L, H, -1)
    keys = self.key_projection(keys).view(B, S, H, -1)
    values = self.value_projection(values).view(B, S, H, -1)

    out, attn = self.inner_attention(
        queries, keys, values, attn_mask, tau=tau, delta=delta
    )
    out = out.view(B, L, -1)

    return self.out_projection(out), attn

中文注释版：

python

def forward(self, queries, keys, values, attn_mask, tau=None, delta=None):
    # queries: [B, L, d_model]
    # keys:    [B, S, d_model]
    # values:  [B, S, d_model]
    B, L, _ = queries.shape
    _, S, _ = keys.shape

    # H 是多头数，来自 AttentionLayer.__init__(..., n_heads)
    H = self.n_heads

    # 每个 Linear 都先把 d_model 映射到 n_heads * head_dim
    # 然后 view 成 [B, token数, n_heads, head_dim]
    queries = self.query_projection(queries).view(B, L, H, -1)
    keys = self.key_projection(keys).view(B, S, H, -1)
    values = self.value_projection(values).view(B, S, H, -1)

    # 真正的注意力计算交给 inner_attention
    # PatchTST 这里通常是 FullAttention
    out, attn = self.inner_attention(
        queries, keys, values, attn_mask, tau=tau, delta=delta
    )

    # FullAttention 返回 [B, L, H, head_dim]
    # 这里把 H 和 head_dim 合并回 d_model
    out = out.view(B, L, -1)

    # 最后再做一个输出线性层，回到 [B, L, d_model]
    return self.out_projection(out), attn

3. 第一步：读 shape 和 n_heads

对应代码：

python

B, L, _ = queries.shape
_, S, _ = keys.shape
H = self.n_heads

代入 toy：

text

queries.shape = [1, 2, 4]
keys.shape    = [1, 2, 4]

B = 1
L = 2
S = 2
H = 2

这里 H = self.n_heads 的来源是初始化时：

python

self.n_heads = n_heads

所以 n_heads 不是在 forward 里重新传给某个函数，而是已经保存在当前 AttentionLayer 对象里。

4. 第二步：Linear 生成 Q K V

对应代码：

python

queries = self.query_projection(queries).view(B, L, H, -1)
keys = self.key_projection(keys).view(B, S, H, -1)
values = self.value_projection(values).view(B, S, H, -1)

为了能手算，本文把三个 Linear 都设成单位映射：

text

Wq = I4
Wk = I4
Wv = I4
bq = bk = bv = 0

真实训练时这些权重是可学习参数，不会是单位矩阵。这里用单位矩阵只是为了看清维度和数据怎样流动。

4.1 query_projection

输入：

text

queries before Linear
shape = [1, 2, 4]

[
  [
    [1, 0, 1, 0],
    [0, 1, 0, 1]
  ]
]

线性层：

text

query_projection: Linear(d_model, d_keys * n_heads)
                 = Linear(4, 2 * 2)
                 = Linear(4, 4)

因为 Wq = I4：

text

queries after Linear
shape = [1, 2, 4]

[
  [
    [1, 0, 1, 0],
    [0, 1, 0, 1]
  ]
]

然后：

python

.view(B, L, H, -1)

变成：

text

queries after view
shape = [1, 2, 2, 2]

B=0:
token1:
  head0 = [1, 0]
  head1 = [1, 0]

token2:
  head0 = [0, 1]
  head1 = [0, 1]

图：

text

token1 [1, 0, 1, 0]
        ├─ head0 [1, 0]
        └─ head1 [1, 0]

token2 [0, 1, 0, 1]
        ├─ head0 [0, 1]
        └─ head1 [0, 1]

4.2 key_projection

因为 keys = x，且 Wk = I4：

text

keys after view
shape = [1, 2, 2, 2]

token1:
  head0 = [1, 0]
  head1 = [1, 0]

token2:
  head0 = [0, 1]
  head1 = [0, 1]

4.3 value_projection

因为 values = x，且 Wv = I4：

text

values after view
shape = [1, 2, 2, 2]

token1:
  head0 = [1, 0]
  head1 = [1, 0]

token2:
  head0 = [0, 1]
  head1 = [0, 1]

到这里，AttentionLayer 已经完成了自己的第一件核心工作：

text

把每个 token 的 d_model=4 拆成 n_heads=2 个 head，每个 head_dim=2。

5. 第三步：进入 FullAttention

对应代码：

python

out, attn = self.inner_attention(
    queries, keys, values, attn_mask, tau=tau, delta=delta
)

PatchTST 中：

text

self.inner_attention = FullAttention(...)

所以这里实际进入：

python

FullAttention.forward(queries, keys, values, attn_mask, tau=None, delta=None)

FullAttention 关键代码：

python

def forward(self, queries, keys, values, attn_mask, tau=None, delta=None):
    B, L, H, E = queries.shape
    _, S, _, D = values.shape
    scale = self.scale or 1.0 / sqrt(E)

    scores = torch.einsum("blhe,bshe->bhls", queries, keys)

    if self.mask_flag:
        if attn_mask is None:
            attn_mask = TriangularCausalMask(B, L, device=queries.device)

        scores.masked_fill_(attn_mask.mask, -np.inf)

    A = self.dropout(torch.softmax(scale * scores, dim=-1))
    V = torch.einsum("bhls,bshd->blhd", A, values)

    if self.output_attention:
        return V.contiguous(), A
    else:
        return V.contiguous(), None

中文注释版：

python

def forward(self, queries, keys, values, attn_mask, tau=None, delta=None):
    # queries: [B, L, H, E]
    # keys:    [B, S, H, E]
    # values:  [B, S, H, D]
    B, L, H, E = queries.shape
    _, S, _, D = values.shape

    # E 是每个 head 的维度，toy 里 E=2
    scale = self.scale or 1.0 / sqrt(E)

    # 对每个 batch、每个 head，计算 query token 和 key token 的点积
    # 输出 scores: [B, H, L, S]
    scores = torch.einsum("blhe,bshe->bhls", queries, keys)

    # PatchTST Encoder 里 mask_flag=False，所以 toy 里不做 causal mask
    if self.mask_flag:
        if attn_mask is None:
            attn_mask = TriangularCausalMask(B, L, device=queries.device)

        scores.masked_fill_(attn_mask.mask, -np.inf)

    # 对每个 query token，在所有 key token 方向上 softmax
    A = self.dropout(torch.softmax(scale * scores, dim=-1))

    # 用注意力权重 A 加权求和 values
    # 输出 V: [B, L, H, D]
    V = torch.einsum("bhls,bshd->blhd", A, values)

    if self.output_attention:
        return V.contiguous(), A
    else:
        return V.contiguous(), None

6. 第四步：计算 scores = QK^T

对应代码：

python

scores = torch.einsum("blhe,bshe->bhls", queries, keys)

维度解释：

text

queries: [B, L, H, E]
keys:    [B, S, H, E]
scores:  [B, H, L, S]

在 toy 中：

text

B = 1
L = 2
S = 2
H = 2
E = 2

所以：

text

scores.shape = [1, 2, 2, 2]

6.1 head0 的 Q K

text

Q_head0 =
[
  [1, 0],   # token1
  [0, 1]    # token2
]

K_head0 =
[
  [1, 0],   # token1
  [0, 1]    # token2
]

点积矩阵：

text

scores_head0 = Q_head0 @ K_head0.T

             key1  key2
query1        1     0
query2        0     1

具体算：

text

query1 对 key1: [1,0] · [1,0] = 1
query1 对 key2: [1,0] · [0,1] = 0
query2 对 key1: [0,1] · [1,0] = 0
query2 对 key2: [0,1] · [0,1] = 1

6.2 head1 的 Q K

因为 toy 里 head1 和 head0 数值相同：

text

scores_head1 =

             key1  key2
query1        1     0
query2        0     1

合起来：

text

scores =
[
  head0:
    [
      [1, 0],
      [0, 1]
    ],

  head1:
    [
      [1, 0],
      [0, 1]
    ]
]

7. 第五步：scale 和 softmax 得到注意力权重 A

对应代码：

python

scale = self.scale or 1.0 / sqrt(E)
A = self.dropout(torch.softmax(scale * scores, dim=-1))

toy 里：

text

E = head_dim = 2
scale = 1 / sqrt(2) ≈ 0.707

对 head0：

text

scale * scores_head0 =

             key1   key2
query1      0.707  0
query2      0      0.707

对每一行做 softmax：

text

softmax([0.707, 0]) ≈ [0.669, 0.331]
softmax([0, 0.707]) ≈ [0.331, 0.669]

所以：

text

A_head0 =

             key1   key2
query1      0.669  0.331
query2      0.331  0.669

head1 同理：

text

A_head1 =

             key1   key2
query1      0.669  0.331
query2      0.331  0.669

注意力权重的语义：

text

A[head, query, key]
= 当前 head 中，某个 query token 应该从每个 key/value token 取多少信息。

8. 第六步：V = A @ values

对应代码：

python

V = torch.einsum("bhls,bshd->blhd", A, values)

维度解释：

text

A:      [B, H, L, S]
values: [B, S, H, D]
V:      [B, L, H, D]

8.1 head0 的 values

text

V_input_head0 =
[
  [1, 0],   # value token1
  [0, 1]    # value token2
]

head0 的注意力权重：

text

A_head0 =
[
  [0.669, 0.331],  # query1 对 value1/value2 的权重
  [0.331, 0.669]   # query2 对 value1/value2 的权重
]

加权求和：

text

output query1 head0
= 0.669 * [1, 0] + 0.331 * [0, 1]
= [0.669, 0.331]

output query2 head0
= 0.331 * [1, 0] + 0.669 * [0, 1]
= [0.331, 0.669]

所以：

text

V_output_head0 =
[
  [0.669, 0.331],
  [0.331, 0.669]
]

8.2 head1 的 values

head1 的 values：

text

V_input_head1 =
[
  [1, 0],
  [0, 1]
]

同样得到：

text

V_output_head1 =
[
  [0.669, 0.331],
  [0.331, 0.669]
]

8.3 FullAttention 返回给 AttentionLayer 的 out

FullAttention.forward(...) 返回：

text

out = V.contiguous()
shape = [B, L, H, D] = [1, 2, 2, 2]

数值：

text

out =
[
  token1:
    head0 [0.669, 0.331]
    head1 [0.669, 0.331]

  token2:
    head0 [0.331, 0.669]
    head1 [0.331, 0.669]
]

如果 output_attention=False：

text

attn = None

如果 output_attention=True：

text

attn = A
shape = [B, H, L, S]

9. 第七步：合并 heads

回到 AttentionLayer.forward(...)。

对应代码：

python

out = out.view(B, L, -1)

输入：

text

out before view
shape = [1, 2, 2, 2]

token1:
  head0 [0.669, 0.331]
  head1 [0.669, 0.331]

token2:
  head0 [0.331, 0.669]
  head1 [0.331, 0.669]

合并 head：

text

out after view
shape = [1, 2, 4]

token1 = [0.669, 0.331, 0.669, 0.331]
token2 = [0.331, 0.669, 0.331, 0.669]

图：

text

token1:
  head0 [0.669, 0.331] + head1 [0.669, 0.331]
  -> [0.669, 0.331, 0.669, 0.331]

token2:
  head0 [0.331, 0.669] + head1 [0.331, 0.669]
  -> [0.331, 0.669, 0.331, 0.669]

10. 第八步：out_projection 后 return

对应代码：

python

return self.out_projection(out), attn

out_projection 的定义：

python

self.out_projection = nn.Linear(d_values * n_heads, d_model)

toy 里：

text

d_values = 2
n_heads = 2
d_values * n_heads = 4
d_model = 4

out_projection = Linear(4, 4)

为了手算，设：

text

Wo = I4
bo = 0

所以：

text

output = out_projection(out)
shape = [1, 2, 4]

token1 = [0.669, 0.331, 0.669, 0.331]
token2 = [0.331, 0.669, 0.331, 0.669]

最终返回：

text

return output, attn

其中：

text

output.shape = [B, L, d_model] = [1, 2, 4]
attn = None 或 [B, H, L, S]

PatchTST 后续会把这个 output 当作 EncoderLayer 的 new_x：

python

x = x + self.dropout(new_x)

也就是说，AttentionLayer 本身不会完成整个 EncoderLayer，它只负责：

text

输入 patch token 表示
-> 计算 token 之间的注意力混合
-> 输出同 shape 的新 token 表示

11. 一句话总结

n_heads 的作用不是改变 token 个数，而是把每个 token 的 d_model 维隐藏向量切成多个小空间分别做注意力。

在 toy 里：

text

[1, 2, 4]
-> Q/K/V Linear
-> [1, 2, 2, 2]
-> 每个 head 单独算 scores 和 A V
-> [1, 2, 2, 2]
-> 合并 heads
-> [1, 2, 4]
-> out_projection
-> [1, 2, 4]

所以 AttentionLayer.forward(...) 的输入输出 shape 不变：

text

输入: [B, L, d_model]
输出: [B, L, d_model]

但中间通过：

text

d_model -> n_heads * head_dim

让不同 head 在不同子空间里分别学习 token 之间的依赖。

DLinear_v1_archive

Informer_v1_archive

PatchTST_v1_archive

12-SelfAttention_Family

01-DLinear

02-PatchTST

03-Informer

04D-Layer4-例子-AttentionLayer到return

0. 入口接口

1. 总流程图

2. 真实代码

3. 第一步：读 shape 和 n_heads

4. 第二步：Linear 生成 Q K V

4.1 query_projection

4.2 key_projection

4.3 value_projection

5. 第三步：进入 FullAttention

6. 第四步：计算 scores = QK^T

6.1 head0 的 Q K

6.2 head1 的 Q K

7. 第五步：scale 和 softmax 得到注意力权重 A

8. 第六步：V = A @ values

8.1 head0 的 values

8.2 head1 的 values

8.3 FullAttention 返回给 AttentionLayer 的 out

9. 第七步：合并 heads

10. 第八步：out_projection 后 return

11. 一句话总结

04D-Layer4-例子-AttentionLayer到return ​

0. 入口接口 ​

1. 总流程图 ​

2. 真实代码 ​

3. 第一步：读 shape 和 n_heads ​

4. 第二步：Linear 生成 Q K V ​

4.1 query_projection ​

4.2 key_projection ​

4.3 value_projection ​

5. 第三步：进入 FullAttention ​

6. 第四步：计算 scores = QK^T ​

6.1 head0 的 Q K ​

6.2 head1 的 Q K ​

7. 第五步：scale 和 softmax 得到注意力权重 A ​

8. 第六步：V = A @ values ​

8.1 head0 的 values ​

8.2 head1 的 values ​

8.3 FullAttention 返回给 AttentionLayer 的 out ​

9. 第七步：合并 heads ​

10. 第八步：out_projection 后 return ​

11. 一句话总结 ​

04D-Layer4-例子-AttentionLayer到return

0. 入口接口

1. 总流程图

2. 真实代码

3. 第一步：读 shape 和 n_heads

4. 第二步：Linear 生成 Q K V

4.1 query_projection

4.2 key_projection

4.3 value_projection

5. 第三步：进入 FullAttention

6. 第四步：计算 scores = QK^T

6.1 head0 的 Q K

6.2 head1 的 Q K

7. 第五步：scale 和 softmax 得到注意力权重 A

8. 第六步：V = A @ values

8.1 head0 的 values

8.2 head1 的 values

8.3 FullAttention 返回给 AttentionLayer 的 out

9. 第七步：合并 heads

10. 第八步：out_projection 后 return

11. 一句话总结