Encoder 实例逐步精读：PatchTST Layer2B

Abstract

这篇专门补 modelread/PatchTST/04-Layer2B-Encoder.md 里的 Encoder 实例。
重点不是泛泛讲 Transformer，而是回答：PatchTST 里这个 self.encoder 到底实例化出了什么对象，forward 每一步怎么走，具体 shape 和一个 token 的数值直觉怎么变。

0. 文件索引

项目	内容
父文档	`zdocs/modelread/PatchTST/04-Layer2B-Encoder.md`
源码 1	`models/PatchTST.py -> PatchTST.__init__ / forecast`
源码 2	`layers/Transformer_EncDec.py -> Encoder / EncoderLayer`
源码 3	`layers/SelfAttention_Family.py -> AttentionLayer / FullAttention`
输入	`enc_out = (B*C, patch_num, d_model) = (8,6,16)`
输出	`enc_out = (8,6,16)`，`attns=[None]`

1. toy 参数

沿用 PatchTST modelread 的全局 toy：

符号	值	含义
`B`	2	batch size
`enc_in=C`	4	变量数
`B*C`	8	channel-independent 后的独立序列数
`patch_num`	6	patch token 数
`d_model`	16	token embedding 维度
`n_heads`	2	attention head 数
`d_keys=d_values`	8	每个 head 的 Q/K/V 维度
`d_ff`	64	FFN 中间层宽度
`e_layers`	1	EncoderLayer 层数

进入 Encoder 前：

text

enc_out.shape = (8, 6, 16)

语义：

text

8  = B*C，即 2 个 batch × 4 个变量
6  = 每条变量序列切出的 patch token 数
16 = 每个 patch token 的 d_model 表示

2. 先看实例化：self.encoder 到底是什么

源码位置：PatchTST.py -> PatchTST.__init__

python

self.encoder = Encoder(
    [
        EncoderLayer(
            AttentionLayer(
                FullAttention(
                    False,
                    config.factor,
                    attention_dropout=config.dropout,
                    output_attention=config.output_attention,
                ),
                config.d_model,
                config.n_heads,
            ),
            config.d_model,
            config.d_ff,
            dropout=config.dropout,
            activation=config.activation,
        )
        for l in range(config.e_layers)
    ],
    norm_layer=torch.nn.LayerNorm(config.d_model),
)

图解：

![[zdocs/pytorch-basics/assets/patchtst_encoder_instance_init.svg]]

这段代码从内到外创建：

层级	实例	toy 配置	forward 时做什么
1	`FullAttention(False, ...)`	`mask_flag=False`	算 `QK^T -> softmax -> A@V`
2	`AttentionLayer(FullAttention, 16, 2)`	`d_model=16, n_heads=2`	Linear 生成 Q/K/V，拆多头，调用 FullAttention
3	`EncoderLayer(AttentionLayer, 16, 64)`	`d_ff=64`	attention 残差 + FFN 残差
4	`Encoder([EncoderLayer], norm_layer=LayerNorm(16))`	`e_layers=1`	循环 1 层，最后 LayerNorm

一句话：

text

Encoder 是调度壳；
EncoderLayer 是 Transformer block；
AttentionLayer 是多头格式转换壳；
FullAttention 是真正的注意力数学。

3. forward 总图：一次 Encoder 调用怎么走

调用现场：PatchTST.forecast

python

enc_out, attns = self.encoder(enc_out)

此时：

text

enc_out: (8,6,16)

图解：

![[zdocs/pytorch-basics/assets/patchtst_encoder_forward_steps.svg]]

整体顺序：

text

Encoder.forward
  -> EncoderLayer.forward
      -> AttentionLayer.forward
          -> FullAttention.forward
      -> attention residual + norm1
      -> FFN conv1/conv2
      -> FFN residual + norm2
  -> Encoder final LayerNorm
  -> return x, attns

输出：

text

x:     (8,6,16)
attns: [None]

attns=[None] 的原因是 FullAttention(..., output_attention=config.output_attention) 中 output_attention=False。

4. Level 1：Encoder.forward 是调度层

源码：

python

def forward(self, x, attn_mask=None, tau=None, delta=None):
    attns = []
    if self.conv_layers is not None:
        ...
    else:
        for attn_layer in self.attn_layers:
            x, attn = attn_layer(x, attn_mask=attn_mask, tau=tau, delta=delta)
            attns.append(attn)

    if self.norm is not None:
        x = self.norm(x)

    return x, attns

PatchTST 的 Encoder 构造时没有传 conv_layers：

text

self.conv_layers = None

所以走 else：

python

for attn_layer in self.attn_layers:
    x, attn = attn_layer(x, ...)

toy 里：

text

self.attn_layers = ModuleList([EncoderLayer_0])

所以只循环一次：

text

输入 x: (8,6,16)
EncoderLayer_0(x) -> (8,6,16), attn=None
attns.append(None) -> [None]
self.norm(x) -> LayerNorm(16), shape 不变
return (8,6,16), [None]

这里要分清两个归一化：

位置	名字	作用
`EncoderLayer` 内	`norm1` / `norm2`	每个 block 内部的两个残差归一化
`Encoder` 外层	`self.norm`	所有 EncoderLayer 跑完后的最终归一化

5. Level 2：EncoderLayer.forward 是 Transformer block

源码：

python

def forward(self, x, attn_mask=None, tau=None, delta=None):
    new_x, attn = self.attention(x, x, x, attn_mask=attn_mask, tau=tau, delta=delta)
    x = x + self.dropout(new_x)

    y = x = self.norm1(x)
    y = self.dropout(self.activation(self.conv1(y.transpose(-1, 1))))
    y = self.dropout(self.conv2(y).transpose(-1, 1))

    return self.norm2(x + y), attn

5.1 第一步：self-attention

python

new_x, attn = self.attention(x, x, x, ...)

三个参数都是 x，所以这是 self-attention：

text

queries = x
keys    = x
values  = x

shape：

text

x:     (8,6,16)
new_x: (8,6,16)
attn:  None

逻辑：

text

每条单变量 patch 序列里，6 个 patch token 互相读取信息。
不同变量之间不会互相 attention，因为变量已经被合并到 batch 维 B*C=8。

5.2 第二步：残差一 + norm1

python

x = x + self.dropout(new_x)
y = x = self.norm1(x)

残差一：

text

原 token 表示 x
+ attention 更新 new_x
= 保留原信息，同时加入跨 patch 信息

y = x = ... 是 Python 连续赋值：

text

先计算 self.norm1(x)
再让 x 和 y 都指向这个归一化后的结果

后面 FFN 会改写 y，但 x 保留为残差二的主支。

5.3 第三步：FFN 的 conv1

python

y = self.dropout(self.activation(self.conv1(y.transpose(-1, 1))))

当前：

text

y: (8,6,16) = (B*C, patch_num, d_model)

Conv1d 要求 (B,C,L)，所以先：

text

y.transpose(-1, 1)
(8,6,16) -> (8,16,6)

语义变成：

text

B = 8
C = 16 = d_model
L = 6 = patch_num

conv1 = Conv1d(16 -> 64, kernel_size=1)：

text

(8,16,6) -> (8,64,6)

kernel_size=1 的含义：

text

每个 patch 位置独立做 16 -> 64 的线性变换；
不混合相邻 patch。

5.4 第四步：FFN 的 conv2 + 转回

python

y = self.dropout(self.conv2(y).transpose(-1, 1))

conv2 = Conv1d(64 -> 16, kernel_size=1)：

text

(8,64,6) -> (8,16,6)

转回 Transformer 常用格式：

text

(8,16,6).transpose(-1,1) -> (8,6,16)

现在 y 和残差主支 x shape 一样，可以相加。

5.5 第五步：残差二 + norm2

python

return self.norm2(x + y), attn

shape：

text

x:     (8,6,16)
y:     (8,6,16)
x + y: (8,6,16)
norm2: (8,6,16)

逻辑：

text

attention 负责让 patch 之间交换信息；
FFN 负责让每个 patch token 自己做非线性加工；
两个残差负责保留原信号，避免层太深时信息断掉。

6. Level 3：AttentionLayer.forward 是多头格式桥梁

源码：

python

def forward(self, queries, keys, values, attn_mask, tau=None, delta=None):
    B, L, _ = queries.shape
    _, S, _ = keys.shape
    H = self.n_heads

    queries = self.query_projection(queries).view(B, L, H, -1)
    keys = self.key_projection(keys).view(B, S, H, -1)
    values = self.value_projection(values).view(B, S, H, -1)

    out, attn = self.inner_attention(
        queries, keys, values, attn_mask, tau=tau, delta=delta
    )
    out = out.view(B, L, -1)

    return self.out_projection(out), attn

输入来自 EncoderLayer：

text

queries = keys = values = x = (8,6,16)

提取变量：

text

B = 8
L = 6
S = 6
H = 2

Q/K/V 投影：

text

query_projection: Linear(16, 16)
key_projection:   Linear(16, 16)
value_projection: Linear(16, 16)

拆多头：

text

(8,6,16) -> view(8,6,2,-1) -> (8,6,2,8)

这里 -1 自动推断为：

\frac{16}{2} = 8

所以：

text

queries: (8,6,2,8)
keys:    (8,6,2,8)
values:  (8,6,2,8)

调用 FullAttention 后：

text

out: (8,6,2,8)

合并多头：

text

out.view(8,6,-1)
(8,6,2,8) -> (8,6,16)

输出投影：

text

out_projection: Linear(16,16)
(8,6,16) -> (8,6,16)

7. Level 4：FullAttention.forward 是注意力数学

源码：

python

B, L, H, E = queries.shape
_, S, _, D = values.shape
scale = self.scale or 1.0 / sqrt(E)

scores = torch.einsum("blhe,bshe->bhls", queries, keys)
A = self.dropout(torch.softmax(scale * scores, dim=-1))
V = torch.einsum("bhls,bshd->blhd", A, values)

输入：

text

queries: (8,6,2,8)
keys:    (8,6,2,8)
values:  (8,6,2,8)

7.1 计算 scores

python

scores = torch.einsum("blhe,bshe->bhls", queries, keys)

下标解释：

字母	含义	toy 值
`b`	独立序列，`B*C`	8
`l`	query patch 位置	6
`s`	key patch 位置	6
`h`	head 编号	2
`e`	每个 head 的向量维度	8

e 消失，说明在 e 维做点积：

s c o r e s_{b, h, l, s} = \sum_{e} Q_{b, l, h, e} K_{b, s, h, e}

输出：

text

scores: (8,2,6,6)

固定某个 b,h，就是：

text

Q: (6,8)
K: (6,8)
Q @ K.T -> (6,6)

7.2 scale + softmax

python

scale = 1.0 / sqrt(E)

toy：

s c a l e = \frac{1}{\sqrt{8}} \approx 0.354

python

A = torch.softmax(scale * scores, dim=-1)

dim=-1 是 key 维 S：

text

对每个 query patch，让它对 6 个 key patch 的权重和为 1。

即：

\sum_{s} A_{b, h, l, s} = 1

输出：

text

A: (8,2,6,6)

7.3 加权求和 values

python

V = torch.einsum("bhls,bshd->blhd", A, values)

s 消失，说明沿 key patch 维加权求和：

V_{b, l, h, d}^{'} = \sum_{s} A_{b, h, l, s} V_{b, s, h, d}

输出：

text

V': (8,6,2,8)

这就是交还给 AttentionLayer 的多头输出。

8. 一个 token 的数值直觉

下面只看 x[0,0,:] 的前 4 个维度，真实维度是 16。数值是人为构造，用来理解流向。

![[zdocs/pytorch-basics/assets/patchtst_encoder_numeric_token.svg]]

假设：

text

x[0,0,:4] = [1.0, 0.5, -0.5, 2.0]

注意力输出：

text

new_x[0,0,:4] = [0.2, -0.1, 0.4, 0.0]

残差一：

text

x + new_x = [1.2, 0.4, -0.1, 2.0]

norm1 会把 16 维 token 内部归一化，使均值接近 0、方差稳定。然后 FFN 产生：

text

y[0,0,:4] = [0.1, 0.3, -0.2, 0.2]

残差二：

text

norm2(x + y)

最终输出仍是一个 16 维 token，但它已经融合了：

text

1. 原始 patch token 信息
2. 来自其它 patch 的 attention 信息
3. FFN 对当前 token 的非线性加工

9. 从 Encoder 出来后发生什么

回到 PatchTST.forecast：

python

enc_out, attns = self.encoder(enc_out)

enc_out = torch.reshape(
    enc_out, (-1, n_vars, enc_out.shape[-2], enc_out.shape[-1])
)
enc_out = enc_out.permute(0, 1, 3, 2)

Encoder 输出：

text

(8,6,16)

还原 B 和 C：

text

reshape(-1, n_vars=4, 6, 16)
(8,6,16) -> (2,4,6,16)

给 FlattenHead 准备格式：

text

permute(0,1,3,2)
(2,4,6,16) -> (2,4,16,6)

也就是说：

text

Encoder 只负责在每个变量自己的 6 个 patch token 间做 Transformer 编码；
它不负责预测 pred_len。
预测由后面的 FlattenHead 完成。

10. 一句话总结

PatchTST 的 Encoder 实例可以这样记：

text

Encoder 调度 1 个 EncoderLayer；
EncoderLayer 做 attention 残差 + FFN 残差；
AttentionLayer 负责 d_model 和 multi-head 格式转换；
FullAttention 负责 QK^T -> softmax -> A@V。

全程 shape 保持：

text

(B*C, patch_num, d_model) = (8,6,16)

但每个 token 的数值语义已经从“单个 patch 的 embedding”变成“融合了同变量内其它 patch 信息的上下文表示”。

DLinear_v1_archive

Informer_v1_archive

PatchTST_v1_archive

12-SelfAttention_Family

01-DLinear

02-PatchTST

03-Informer

Encoder 实例逐步精读：PatchTST Layer2B

0. 文件索引

1. toy 参数

2. 先看实例化：self.encoder 到底是什么

3. forward 总图：一次 Encoder 调用怎么走

4. Level 1：Encoder.forward 是调度层

5. Level 2：EncoderLayer.forward 是 Transformer block

5.1 第一步：self-attention

5.2 第二步：残差一 + norm1

5.3 第三步：FFN 的 conv1

5.4 第四步：FFN 的 conv2 + 转回

5.5 第五步：残差二 + norm2

6. Level 3：AttentionLayer.forward 是多头格式桥梁

7. Level 4：FullAttention.forward 是注意力数学

7.1 计算 scores

7.2 scale + softmax

7.3 加权求和 values

8. 一个 token 的数值直觉

9. 从 Encoder 出来后发生什么

10. 一句话总结

Encoder 实例逐步精读：PatchTST Layer2B ​

0. 文件索引 ​

1. toy 参数 ​

2. 先看实例化：self.encoder 到底是什么 ​

3. forward 总图：一次 Encoder 调用怎么走 ​

4. Level 1：Encoder.forward 是调度层 ​

5. Level 2：EncoderLayer.forward 是 Transformer block ​

5.1 第一步：self-attention ​

5.2 第二步：残差一 + norm1 ​

5.3 第三步：FFN 的 conv1 ​

5.4 第四步：FFN 的 conv2 + 转回 ​

5.5 第五步：残差二 + norm2 ​

6. Level 3：AttentionLayer.forward 是多头格式桥梁 ​

7. Level 4：FullAttention.forward 是注意力数学 ​

7.1 计算 scores ​

7.2 scale + softmax ​

7.3 加权求和 values ​

8. 一个 token 的数值直觉 ​

9. 从 Encoder 出来后发生什么 ​

10. 一句话总结 ​

Encoder 实例逐步精读：PatchTST Layer2B

0. 文件索引

1. toy 参数

2. 先看实例化：self.encoder 到底是什么

3. forward 总图：一次 Encoder 调用怎么走

4. Level 1：Encoder.forward 是调度层

5. Level 2：EncoderLayer.forward 是 Transformer block

5.1 第一步：self-attention

5.2 第二步：残差一 + norm1

5.3 第三步：FFN 的 conv1

5.4 第四步：FFN 的 conv2 + 转回

5.5 第五步：残差二 + norm2

6. Level 3：AttentionLayer.forward 是多头格式桥梁

7. Level 4：FullAttention.forward 是注意力数学

7.1 计算 scores

7.2 scale + softmax

7.3 加权求和 values

8. 一个 token 的数值直觉

9. 从 Encoder 出来后发生什么

10. 一句话总结