Layer 3 — EncoderLayer 精读

父层（Layer 2B）Encoder.forward 的循环体调用 attn_layer(x, ...)。
本文档只覆盖 EncoderLayer.forward 这一层（Transformer block 结构）。
子层 AttentionLayer 及以下见 04-Layer4-AttentionLayer。

1. 在父层中的位置

Encoder.forward
  └─ for i, (attn_layer, conv_layer) in enumerate(zip(...)):
         x, attn = attn_layer(x, ...)   ← 本文档（EncoderLayer 0）
         x = conv_layer(x)
     x, attn = attn_layers[-1](x, ...)  ← 本文档（EncoderLayer 1）
          └─ self.attention(x, x, x, ...)   → 详见 Layer4 AttentionLayer

2. I/O 接口定义

python

def forward(self, x, attn_mask=None, tau=None, delta=None):

	shape（toy，EncoderLayer 0）	含义
输入 `x`	`(3, 10, 8)` = `(B, seq_len, d_model)`	当前层的 token 序列
输出 `x`	`(3, 10, 8)`	经过注意力 + FFN 变换后，形状不变
输出 `attn`	`None`	`output_attention=False` 时为 `None`

EncoderLayer 1 输入为 (3, 6, 8)（ConvLayer 压缩后），形状逻辑相同。

3. 顺序图（具体层）

4. 语义分组图（索引层）

两块结构完全对称：操作 → 残差 → LayerNorm（Post-norm 形式）。
与 PatchTST 的 EncoderLayer 结构相同，区别在于：Informer 用 ProbAttention，激活函数用 relu。

5. 逐步解析

5.0 完整原始代码

python

class EncoderLayer(nn.Module):
    def __init__(self, attention, d_model, d_ff=None, dropout=0.1, activation="relu"):
        super(EncoderLayer, self).__init__()
        d_ff = d_ff or 4 * d_model
        self.attention = attention
        self.conv1 = nn.Conv1d(in_channels=d_model, out_channels=d_ff, kernel_size=1)   # 升维
        self.conv2 = nn.Conv1d(in_channels=d_ff, out_channels=d_model, kernel_size=1)   # 降维
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        self.activation = F.relu if activation == "relu" else F.gelu

def forward(self, x, attn_mask=None, tau=None, delta=None):
    new_x, attn = self.attention(x, x, x, attn_mask=attn_mask, tau=tau, delta=delta)
    x = x + self.dropout(new_x)

    y = x = self.norm1(x)
    y = self.dropout(self.activation(self.conv1(y.transpose(-1, 1))))
    y = self.dropout(self.conv2(y).transpose(-1, 1))

    return self.norm2(x + y), attn

5.1 注意力残差块（步骤一 + 步骤二）

本节的作用

ProbSparse 自注意力 + dropout 残差 + LayerNorm 构成第一个子块；y = x = norm1(x) 的双赋值为后续 FFN 残差做准备。

步骤一：调用 AttentionLayer（自注意力）

python

new_x, attn = self.attention(x, x, x, attn_mask=attn_mask, tau=tau, delta=delta)

三个位置参数都传同一个 x，对应 queries, keys, values，即自注意力（Q=K=V=x）。attn_mask=None，透传给 ProbAttention（内部自行处理 causal mask）。

返回：new_x=(3,10,8) 是 ProbSparse 注意力加权的新表示；attn=None。
→ AttentionLayer 的内部细节见 04-Layer4-AttentionLayer。

步骤二：残差① + norm1

python

x = x + self.dropout(new_x)

y = x = self.norm1(x)

x + dropout(new_x)：残差跳接，dropout 在加之前施加作正则化。

y = x = self.norm1(x) 一行两赋值：LayerNorm 后 x 和 y 同时指向这个结果。后续 FFN 修改 y，最终 x+y 做残差②。

x_old: (3,10,8)
new_x: (3,10,8)  ← ProbAttention 输出
x = x_old + dropout(new_x): (3,10,8)
x = y = norm1(x): (3,10,8)

5.2 FFN 残差块（步骤三 + 步骤四 + 步骤五）

本节的作用

Position-wise FFN：Conv1d(k=1) 实现的逐位置 Linear(8→24) → relu → Linear(24→8) + dropout 残差 + LayerNorm 构成第二个子块。

Position-wise FFN：对序列中每个位置独立做 Linear(8→24) → relu → Linear(24→8)，用 Conv1d(k=1) 实现。

步骤三：FFN 第一层（升维 + 激活）

python

y = self.dropout(self.activation(self.conv1(y.transpose(-1, 1))))

y.transpose(-1, 1): (3, 10, 8) → (3, 8, 10)   Conv1d 需要 (B, C, L) 格式
conv1: Conv1d(8, 24, k=1): (3, 8, 10) → (3, 24, 10)
relu(...)：逐元素激活（Informer 默认 relu，PatchTST 是 gelu）
dropout: (3, 24, 10)

步骤四：FFN 第二层（降维）

python

y = self.dropout(self.conv2(y).transpose(-1, 1))

conv2: Conv1d(24, 8, k=1): (3, 24, 10) → (3, 8, 10)
transpose(-1, 1): (3, 8, 10) → (3, 10, 8)
dropout: (3, 10, 8)

步骤五：残差② + norm2

python

return self.norm2(x + y), attn

x: (3,10,8)  ← norm1 输出（步骤二赋值，未被 FFN 修改）
y: (3,10,8)  ← FFN 输出
norm2(x + y): (3,10,8)

6. 下钻子组件

子组件	职责	下层文档
`AttentionLayer`（`self.attention`）	d_model 格式 ↔ 多头格式桥梁；Q/K/V 投影 + 委托 ProbAttention	04-Layer4-AttentionLayer

DLinear_v1_archive

Informer_v1_archive

PatchTST_v1_archive

12-SelfAttention_Family

01-DLinear

02-PatchTST

03-Informer

Layer 3 — EncoderLayer 精读

1. 在父层中的位置

2. I/O 接口定义

3. 顺序图（具体层）

4. 语义分组图（索引层）

5. 逐步解析

5.0 完整原始代码

5.1 注意力残差块（步骤一 + 步骤二）

5.2 FFN 残差块（步骤三 + 步骤四 + 步骤五）

6. 下钻子组件

Layer 3 — EncoderLayer 精读 ​

1. 在父层中的位置 ​

2. I/O 接口定义 ​

3. 顺序图（具体层） ​

4. 语义分组图（索引层） ​

5. 逐步解析 ​

5.0 完整原始代码 ​

5.1 注意力残差块（步骤一 + 步骤二） ​

5.2 FFN 残差块（步骤三 + 步骤四 + 步骤五） ​

6. 下钻子组件 ​

Layer 3 — EncoderLayer 精读

1. 在父层中的位置

2. I/O 接口定义

3. 顺序图（具体层）

4. 语义分组图（索引层）

5. 逐步解析

5.0 完整原始代码

5.1 注意力残差块（步骤一 + 步骤二）

5.2 FFN 残差块（步骤三 + 步骤四 + 步骤五）

6. 下钻子组件