4C Encoder 主链

Abstract

这一篇是：
04-Level4-short_forecast五段总览 里 4C Encoder 这个子块的下钻文档。
只讲：
Informer 的 encoder 怎样把 embedding 后的历史表示加工成可供 decoder 读取的上下文表示。

1. 上下文

上一层：

下一层：

这一层的入口代码是：

python

enc_out, attns = self.encoder(enc_out, attn_mask=None)

这一层的输出是：

python

enc_out.shape = (B, L', d_model)

2. 当前层第一性

这一层存在的第一性是：

把“每个时间步各自的隐藏表示”变成“彼此看过整段历史后的上下文表示”。

3. 本层入口参数与输出含义

3.1 输入

x
- encoder 输入隐藏表示，形状 (B, L, d_model)
d_model
- 每个时间步向量长度
e_layers
- encoder 层数
n_heads
- attention 头数
d_ff
- FFN 中间维
distil
- 是否插入 ConvLayer 做长度压缩

3.2 输出

enc_out
- 最终上下文表示，形状 (B, L', d_model)
attns
- 每层 attention 权重集合

4. 顺序图

5. 抽象树

6. 当前真实例子与 toy 例子

6.1 真实运行例子

当前真实例子里，adapter 默认参数会补出：

e_layers = 2
n_heads = 8
d_ff = 128
distil = True
task_name = "short_term_forecast"

所以当前真实路径里：

text

EncoderLayer 1 -> ConvLayer -> EncoderLayer 2 -> LayerNorm

6.2 固定 toy 例子

B = 1
L = 4
d_model = 4
d_ff = 8
e_layers = 2
distil = True

python

x0 = [
    [e11, e12, e13, e14],
    [e21, e22, e23, e24],
    [e31, e32, e33, e34],
    [e41, e42, e43, e44],
]  # (1, 4, 4)

7. 代码块 1：`EncoderLayer.forward(...)`

位置：

Transformer_EncDec.py

完整代码：

python

class EncoderLayer(nn.Module):
    def __init__(self, attention, d_model, d_ff=None, dropout=0.1, activation="relu"):
        super(EncoderLayer, self).__init__()
        d_ff = d_ff or 4 * d_model
        self.attention = attention
        self.conv1 = nn.Conv1d(in_channels=d_model, out_channels=d_ff, kernel_size=1)
        self.conv2 = nn.Conv1d(in_channels=d_ff, out_channels=d_model, kernel_size=1)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        self.activation = F.relu if activation == "relu" else F.gelu

    def forward(self, x, attn_mask=None, tau=None, delta=None):
        new_x, attn = self.attention(x, x, x, attn_mask=attn_mask, tau=tau, delta=delta)
        x = x + self.dropout(new_x)

        y = x = self.norm1(x)
        y = self.dropout(self.activation(self.conv1(y.transpose(-1, 1))))
        y = self.dropout(self.conv2(y).transpose(-1, 1))

        return self.norm2(x + y), attn

7.1 子块 A：self-attention

对应代码：

python

new_x, attn = self.attention(x, x, x, ...)
x = x + self.dropout(new_x)

toy 张量演变图

text

输入 x0 = (1, 4, 4)

步骤 1: self-attention 读取整段历史
  x0 -> new_x
  new_x =
  [
    [a11, a12, a13, a14],
    [a21, a22, a23, a24],
    [a31, a32, a33, a34],
    [a41, a42, a43, a44],
  ]  (1, 4, 4)

  为了理解注意力内部，固定第 1 个时间步的 toy 注意力权重：
    alpha_1 = [0.4, 0.3, 0.2, 0.1]

  如果四个 value 向量分别是：
    v1 = [e11, e12, e13, e14]
    v2 = [e21, e22, e23, e24]
    v3 = [e31, e32, e33, e34]
    v4 = [e41, e42, e43, e44]

  那么第 1 个时间步的新表示就是：
    new_x[1] = 0.4*v1 + 0.3*v2 + 0.2*v3 + 0.1*v4

  也就是说，attention 的本质不是改 shape，
  而是把“整段历史的 value 向量按权重加权求和”。

步骤 2: residual
  x1 = x0 + new_x = (1, 4, 4)

这一步的 input / output 语义

输入 x0
- 历史窗口当前的隐藏表示
输出 new_x
- 每个时间步看过整段历史后的增量表示
输出 x1
- 保留原始信息并叠加上下文后的表示

7.2 子块 B：norm + FFN

对应代码：

python

y = x = self.norm1(x)
y = self.dropout(self.activation(self.conv1(y.transpose(-1, 1))))
y = self.dropout(self.conv2(y).transpose(-1, 1))
return self.norm2(x + y), attn

toy 张量演变图

text

输入 x1 = (1, 4, 4)

步骤 1: LayerNorm
  x2 = norm1(x1) = (1, 4, 4)

步骤 2: transpose 给 Conv1d
  x2_t = (1, 4, 4)

步骤 3: conv1: d_model=4 -> d_ff=8
  h = (1, 8, 4)

步骤 4: activation + dropout
  h' = (1, 8, 4)

步骤 5: conv2: d_ff=8 -> d_model=4
  y_t = (1, 4, 4)

步骤 6: transpose 回来
  y = (1, 4, 4)

步骤 7: residual + norm2
  x3 = norm2(x2 + y) = (1, 4, 4)

这一步的 input / output 语义

输入 x1
- 已含上下文关系的隐藏表示
中间 h
- FFN 扩宽后的逐位置表示
输出 x3
- 单层 encoder 最终输出

8. 代码块 2：`ConvLayer.forward(...)`

完整代码：

python

class ConvLayer(nn.Module):
    def __init__(self, c_in):
        super(ConvLayer, self).__init__()
        self.downConv = nn.Conv1d(
            in_channels=c_in,
            out_channels=c_in,
            kernel_size=3,
            padding=2,
            padding_mode="circular",
        )
        self.norm = nn.BatchNorm1d(c_in)
        self.activation = nn.ELU()
        self.maxPool = nn.MaxPool1d(kernel_size=3, stride=2, padding=1)

    def forward(self, x):
        x = self.downConv(x.permute(0, 2, 1))
        x = self.norm(x)
        x = self.activation(x)
        x = self.maxPool(x)
        x = x.transpose(1, 2)
        return x

8.1 toy 张量演变图

text

输入 x3 = (1, 4, 4)

步骤 1: permute -> (1, 4, 4)
步骤 2: downConv -> 通道数仍是 4
  为了理解卷积核，固定第 1 个输出通道的一组 toy 权重：
    [1, 0, -1]
  如果当前某个位置看到的局部窗口是 [x_a, x_b, x_c]，
  那么卷积输出就是 x_a - x_c。

步骤 3: BatchNorm + ELU
  把卷积结果先标准化，再过 ELU 非线性

步骤 4: MaxPool(kernel=3, stride=2, padding=1)
  不再逐点保留，而是在相邻局部窗口里取最大值
  时间长度被压缩
  4 -> 2 或 3

输出 x4 = (1, L_small, 4)

这一步的 input / output 语义

输入 x3
- 单层 encoder 输出
输出 x4
- 时间长度被压缩后的 encoder 表示

9. 代码块 3：`Encoder.forward(...)`

完整代码：

python

class Encoder(nn.Module):
    def __init__(self, attn_layers, conv_layers=None, norm_layer=None):
        super(Encoder, self).__init__()
        self.attn_layers = nn.ModuleList(attn_layers)
        self.conv_layers = (
            nn.ModuleList(conv_layers) if conv_layers is not None else None
        )
        self.norm = norm_layer

    def forward(self, x, attn_mask=None, tau=None, delta=None):
        attns = []
        if self.conv_layers is not None:
            for i, (attn_layer, conv_layer) in enumerate(
                zip(self.attn_layers, self.conv_layers)
            ):
                delta = delta if i == 0 else None
                x, attn = attn_layer(x, attn_mask=attn_mask, tau=tau, delta=delta)
                x = conv_layer(x)
                attns.append(attn)
            x, attn = self.attn_layers[-1](x, tau=tau, delta=None)
            attns.append(attn)
        else:
            for attn_layer in self.attn_layers:
                x, attn = attn_layer(x, attn_mask=attn_mask, tau=tau, delta=delta)
                attns.append(attn)

        if self.norm is not None:
            x = self.norm(x)

        return x, attns

9.1 toy 张量演变图

text

输入 x0 = (1, 4, 4)

步骤 1: 第 1 个 EncoderLayer
  x0 -> x3 = (1, 4, 4)

步骤 2: ConvLayer 压缩长度
  x3 -> x4 = (1, L_small, 4)

步骤 3: 最后一层 EncoderLayer
  x4 -> x5 = (1, L_small, 4)

步骤 4: LayerNorm
  x6 = (1, L_small, 4)

输出:
  enc_out = x6
  attns = [attn_layer1, attn_layer2]

9.2 这一段的 input / output 语义

输入 x0
- embedding 后的历史表示
输出 enc_out
- 供 decoder cross-attention 读取的上下文表示
输出 attns
- 各层注意力权重集合

10. 当前层真正要固定什么

当前真实路径里 encoder 不是简单两层直通，而是：
- EncoderLayer -> ConvLayer -> EncoderLayer
d_model 始终不变
distil=True 时，长度维 L 可能变短
本层最关键参数是：
- e_layers
- distil
- n_heads
- d_ff

11. 下一步

继续看：

DLinear_v1_archive

Informer_v1_archive

PatchTST_v1_archive

12-SelfAttention_Family

01-DLinear

02-PatchTST

03-Informer

4C Encoder 主链

1. 上下文

2. 当前层第一性

3. 本层入口参数与输出含义

3.1 输入

3.2 输出

4. 顺序图

5. 抽象树

6. 当前真实例子与 toy 例子

6.1 真实运行例子

6.2 固定 toy 例子

7. 代码块 1：`EncoderLayer.forward(...)`

7.1 子块 A：self-attention

toy 张量演变图

这一步的 input / output 语义

7.2 子块 B：norm + FFN

toy 张量演变图

这一步的 input / output 语义

8. 代码块 2：`ConvLayer.forward(...)`

8.1 toy 张量演变图

这一步的 input / output 语义

9. 代码块 3：`Encoder.forward(...)`

9.1 toy 张量演变图

9.2 这一段的 input / output 语义

10. 当前层真正要固定什么

11. 下一步

4C Encoder 主链 ​

1. 上下文 ​

2. 当前层第一性 ​

3. 本层入口参数与输出含义 ​

3.1 输入 ​

3.2 输出 ​

4. 顺序图 ​

5. 抽象树 ​

6. 当前真实例子与 toy 例子 ​

6.1 真实运行例子 ​

6.2 固定 toy 例子 ​

7. 代码块 1：EncoderLayer.forward(...) ​

7.1 子块 A：self-attention ​

toy 张量演变图 ​

这一步的 input / output 语义 ​

7.2 子块 B：norm + FFN ​

toy 张量演变图 ​

这一步的 input / output 语义 ​

8. 代码块 2：ConvLayer.forward(...) ​

8.1 toy 张量演变图 ​

这一步的 input / output 语义 ​

9. 代码块 3：Encoder.forward(...) ​

9.1 toy 张量演变图 ​

9.2 这一段的 input / output 语义 ​

10. 当前层真正要固定什么 ​

11. 下一步 ​

4C Encoder 主链

1. 上下文

2. 当前层第一性

3. 本层入口参数与输出含义

3.1 输入

3.2 输出

4. 顺序图

5. 抽象树

6. 当前真实例子与 toy 例子

6.1 真实运行例子

6.2 固定 toy 例子

7. 代码块 1：`EncoderLayer.forward(...)`

7.1 子块 A：self-attention

toy 张量演变图

这一步的 input / output 语义

7.2 子块 B：norm + FFN

toy 张量演变图

这一步的 input / output 语义

8. 代码块 2：`ConvLayer.forward(...)`

8.1 toy 张量演变图

这一步的 input / output 语义

9. 代码块 3：`Encoder.forward(...)`

9.1 toy 张量演变图

9.2 这一段的 input / output 语义

10. 当前层真正要固定什么

11. 下一步