Level 4 encoder 全链精读

Abstract

这一篇对应 00-DLinear总览与Level树的 Level 4。
它只做一件事：
把 encoder(x_enc) 从入口到出口，一条链完整走完——每一步的代码、shape 变化、以及为什么要这么做，全部在这一篇里讲清楚。
不分散到 04A/04B，不重复跳转。

1. 第一性

DLinear 的核心不是注意力，不是堆叠 encoder，而是：先把序列拆成"慢变化"和"快变化"，再对每路分别做最简单的线性时间外推，最后相加。

读完这篇，你应该能回答：

seasonal 和 trend 各是什么，是怎么算出来的？
为什么要在 Linear 之前做 permute？
Linear_Seasonal 的权重是什么形状？初始化成什么值？为什么？
individual=False 和 individual=True 的区别是什么，代码层面有什么不同？
整条链的每一步 shape 怎么变？

2. 上下文

上一层：03-Level3-forward主链

入口代码：

python

# DLinear.forecast → DLinear.encoder
return self.encoder(x_enc)

toy 入口形状：x_enc: (2, 6, 3)，即 (B=2, seq_len=6, enc_in=3)

toy 出口形状：(2, 2, 3)，即 (B=2, pred_len=2, enc_in=3)

原理→代码映射

DLinear 是三个模型里代码最接近论文图的，基本一一对应：

论文步骤	对应代码	文件行	说明
序列分解 x = seasonal + trend	`self.decompsition(x)`	`DLinear.py:encoder()`	series_decomp → (seasonal_init, trend_init)，⚠️ 属性名拼错（少了 o）
把时间轴换到最后	`.permute(0,2,1)`	步骤 2	Linear 作用在最后一维，所以必须把 seq_len 换到 dim=-1
季节性线性外推	`self.Linear_Seasonal(seasonal_init)`	步骤 3	Linear(seq_len → pred_len)，把历史映射到未来
趋势线性外推	`self.Linear_Trend(trend_init)`	步骤 3	和季节性完全并行，互不干扰
两路相加	`seasonal_output + trend_output`	步骤 4	论文图里的 "+" 操作，逐元素加
换回时间最后	`.permute(0,2,1)`	步骤 4	换回 (B, pred_len, enc_in) 供上层使用

为什么要在 Linear 前后各做一次 permute？

nn.Linear 的规则：只作用在最后一个维度
                 把 (..., in_features) 变成 (..., out_features)

原始 x: (B=2, seq_len=6, enc_in=3)
                             ↑
                     最后一维 = enc_in（变量数），不是时间轴！

Linear(seq_len→pred_len) 要作用在时间轴上
→ 必须先 permute(0,2,1) 把 seq_len 换到最后：(2,3,6)
→ Linear 才能把 6 步历史映射到 2 步预测：(2,3,2)
→ permute(0,2,1) 换回：(2,2,3) = (B, pred_len, enc_in)

3. 全链顺序图

4. 完整代码（带中文注释）

位置：models/DLinear.py

python

def encoder(self, x):
    # ── 步骤 1：序列分解 ────────────────────────────────────────
    # 输入 x: (B, seq_len, C)
    # 输出 seasonal_init, trend_init: 各 (B, seq_len, C)
    # ⚠️  源码属性名拼错：self.decompsition（少了 o），是 DLinear.py 原始 typo，已修复
    seasonal_init, trend_init = self.decompsition(x)

    # ── 步骤 2：permute，把时间轴换到最后 ───────────────────────
    # nn.Linear 作用在最后一维，必须让 seq_len 在最后
    # (B, seq_len, C) → (B, C, seq_len)
    # ⚠️  源码原始写法第 66 行有一个 bug：seasonal_ init（中间有空格），导致 SyntaxError 无法运行
    #     已在源码中修复为 seasonal_init
    seasonal_init = seasonal_init.permute(0, 2, 1)
    trend_init    = trend_init.permute(0, 2, 1)

    # ── 步骤 3：线性外推 ─────────────────────────────────────────
    if self.individual:
        # individual=True：每个变量通道用自己的一套 Linear 参数
        seasonal_output = torch.zeros(
            [seasonal_init.size(0), seasonal_init.size(1), self.pred_len],
            dtype=seasonal_init.dtype,
        ).to(seasonal_init.device)
        trend_output = torch.zeros(
            [trend_init.size(0), trend_init.size(1), self.pred_len],
            dtype=trend_init.dtype,
        ).to(trend_init.device)
        for i in range(self.channels):
            seasonal_output[:, i, :] = self.Linear_Seasonal[i](seasonal_init[:, i, :])
            trend_output[:, i, :]    = self.Linear_Trend[i](trend_init[:, i, :])
    else:
        # individual=False：所有变量共享同一套 Linear 参数（默认）
        # (B, C, seq_len) → (B, C, pred_len)
        seasonal_output = self.Linear_Seasonal(seasonal_init)
        trend_output    = self.Linear_Trend(trend_init)

    # ── 步骤 4：两路相加，再 permute 回 (B, pred_len, C) ────────
    x = seasonal_output + trend_output         # (B, C, pred_len)
    return x.permute(0, 2, 1)                  # (B, pred_len, C)

5. 步骤 1：`series_decomp`（序列分解）

5.1 完整代码

位置：layers/Autoformer_EncDec.py

python

class moving_avg(nn.Module):
    def __init__(self, kernel_size, stride):
        super(moving_avg, self).__init__()
        self.kernel_size = kernel_size
        self.avg = nn.AvgPool1d(kernel_size=kernel_size, stride=stride, padding=0)

    def forward(self, x):
        # x: (B, seq_len, C)

        # 第 1 步：前后补边，保持长度不变
        front = x[:, 0:1, :].repeat(1, (self.kernel_size - 1) // 2, 1)
        end   = x[:, -1:, :].repeat(1, (self.kernel_size - 1) // 2, 1)
        x = torch.cat([front, x, end], dim=1)
        # 补边后长度: seq_len + 2 * ((kernel_size-1)//2)

        # 第 2 步：AvgPool1d 要求 (B, C, L)，先 permute
        x = self.avg(x.permute(0, 2, 1))
        # AvgPool1d 输出长度: 补边后长度 - kernel_size + 1 = seq_len

        # 第 3 步：permute 回 (B, seq_len, C)
        x = x.permute(0, 2, 1)
        return x


class series_decomp(nn.Module):
    def __init__(self, kernel_size):
        super(series_decomp, self).__init__()
        self.moving_avg = moving_avg(kernel_size, stride=1)

    def forward(self, x):
        moving_mean = self.moving_avg(x)   # 慢变化（趋势）
        res = x - moving_mean              # 快变化（残差/季节性）
        return res, moving_mean
        # 注意返回顺序：第一个是 seasonal(res)，第二个是 trend(moving_mean)

5.2 三步精算

第 1 步：前后补边

公式：

pad_size = (kernel_size - 1) // 2

toy：kernel_size=3 → pad_size = (3-1)//2 = 1

front = x[:, 0:1, :].repeat(1, 1, 1)   ← 把第一个时间步复制 1 次
end   = x[:, -1:, :].repeat(1, 1, 1)   ← 把最后一个时间步复制 1 次

为什么是 (kernel_size-1)//2？

对于大小为 kernel_size 的窗口，让每个位置的输出都对应原始序列的"中心位置"，需要在两端各补 (kernel_size-1)//2 个步骤。
例如 kernel_size=3 时：窗口 [x_{t-1}, x_t, x_{t+1}] 的输出对应 x_t，两端各需补 1 步。

补边后 toy 形状：

(B=2, seq_len=6, C=3)
→ cat([front(2,1,3), x(2,6,3), end(2,1,3)], dim=1)
→ (2, 8, 3)

toy batch=0 第一个特征的序列（由 [1,2,3,4,5,6] 变为）：

[1, 1, 2, 3, 4, 5, 6, 6]
 ↑补  原始序列            ↑补

第 2 步：permute(0,2,1) + AvgPool1d

AvgPool1d 要求输入格式是 (B, C, L)（通道在前，时间在后），所以先 permute：

(2, 8, 3)  →  permute(0,2,1)  →  (2, 3, 8)

AvgPool1d(kernel_size=3, stride=1, padding=0) 输出长度公式：

L_out = (L_in - kernel_size) / stride + 1
      = (8 - 3) / 1 + 1
      = 6

toy 刚好等于原始 seq_len=6，这正是补边的目的。

真实例子：seq_len=96, moving_avg=25

pad_size = (25-1)//2 = 12
补边后：96 + 12 + 12 = 120
AvgPool1d：(120 - 25)/1 + 1 = 96 ✓

AvgPool1d 实际计算（toy）：

取 batch=0，特征 0，补边后序列 [1, 1, 2, 3, 4, 5, 6, 6]，kernel=3 逐窗口取均值：

位置 0：(1+1+2)/3 = 4/3  ≈ 1.33
位置 1：(1+2+3)/3 = 6/3  = 2.00
位置 2：(2+3+4)/3 = 9/3  = 3.00
位置 3：(3+4+5)/3 = 12/3 = 4.00
位置 4：(4+5+6)/3 = 15/3 = 5.00
位置 5：(5+6+6)/3 = 17/3 ≈ 5.67

AvgPool1d 输出 (2, 3, 6)。

第 3 步：permute(0,2,1) 换回 (B, seq_len, C)

(2, 3, 6)  →  permute(0,2,1)  →  (2, 6, 3)

这就是 moving_mean，也就是 trend_init。

第 4 步：res = x - moving_mean

seasonal_init = x_enc       - moving_mean
              = (2, 6, 3)   - (2, 6, 3)
              = (2, 6, 3)

toy batch=0，特征 0 的 seasonal（原始序列 [1,2,3,4,5,6] 减 trend）：

t=0: 1 - 1.33 = -0.33
t=1: 2 - 2.00 =  0.00
t=2: 3 - 3.00 =  0.00
t=3: 4 - 4.00 =  0.00
t=4: 5 - 5.00 =  0.00
t=5: 6 - 5.67 = +0.33

直觉：trend 是平滑背景，seasonal 是原序列去掉背景后的残差（波动）。对于一条完全线性递增的序列，trend 就是它自己，seasonal 几乎全是 0（只有首尾因为边界补边略有偏差）。

5.3 `series_decomp` 返回顺序

python

return res, moving_mean

调用处：

python

seasonal_init, trend_init = self.decompsition(x)

注意：第一个返回值是 seasonal（残差），第二个是 trend（均值）。顺序容易弄反。

5.4 两路 shape 小结

series_decomp 之后：
  seasonal_init: (2, 6, 3)   ← (B, seq_len, enc_in)，快变化
  trend_init:    (2, 6, 3)   ← (B, seq_len, enc_in)，慢变化

6. 步骤 2：`permute(0, 2, 1)`（为什么要换轴）

python

seasonal_init = seasonal_init.permute(0, 2, 1)
trend_init    = trend_init.permute(0, 2, 1)

permute(0, 2, 1) 交换第 1 维（seq_len）和第 2 维（enc_in）：

(2, 6, 3)  →  permute(0, 2, 1)  →  (2, 3, 6)
 B  L  C                             B  C  L

toy 里 seq_len=6 ≠ enc_in=3，所以 shape 从 (2,6,3) 变成 (2,3,6)，变化清晰可见。

为什么要换？

nn.Linear(in_features, out_features) 作用在输入张量的最后一维。

我们要做的是：seq_len → pred_len（时间维外推）。

所以必须让 seq_len 在最后一维：

换前：(B, seq_len, enc_in)，最后一维是 enc_in → Linear 会把 enc_in 映射出去（错误）
换后：(B, enc_in, seq_len)，最后一维是 seq_len → Linear 把 seq_len 映射出去（正确）

7. 步骤 3：Linear 线性头

7.1 Linear 的形状和初始化

__init__ 里：

python

self.Linear_Seasonal = nn.Linear(self.seq_len, self.pred_len)
self.Linear_Trend    = nn.Linear(self.seq_len, self.pred_len)

self.Linear_Seasonal.weight = nn.Parameter(
    (1 / self.seq_len) * torch.ones([self.pred_len, self.seq_len])
)
self.Linear_Trend.weight = nn.Parameter(
    (1 / self.seq_len) * torch.ones([self.pred_len, self.seq_len])
)

权重形状：(pred_len, seq_len)，即 (2, 6)（toy）或 (24, 96)（真实）。

初始化值：(1/seq_len) * ones，即每个权重都是 1/seq_len。

初始化的含义：

Linear(seq_len → pred_len) 的计算是：output = input @ W.T + bias

初始化 W = (1/seq_len) * ones(pred_len, seq_len) 意味着：

output[t] = sum(input * W[t]) = sum(input * (1/seq_len))
           = mean(input)

即初始时，每个预测步都预测为历史序列的均值。这是一个合理的初始点：不偏向任何时间步，等权对待所有历史信息。训练过程中权重会从这个均匀起点开始调整。

7.2 `individual=False`（共享模式，当前默认）

python

seasonal_output = self.Linear_Seasonal(seasonal_init)   # (B, C, seq_len) → (B, C, pred_len)
trend_output    = self.Linear_Trend(trend_init)

Linear 作用在最后一维，而前面的 (B, C) 维度对每个位置独立执行：

输入 seasonal_init: (2, 3, 6)   ← 3 个变量，每个有长度 6 的序列
Linear_Seasonal（W: 2×6）       ← 所有变量共享同一个 W
输出: (2, 3, 2)                 ← 3 个变量，每个预测 2 步

"共享"的意思：特征 0、1、2 的历史序列（各长度 6），都用同一组权重 W（形状 (2,6)）做映射。Linear 的 batch 处理会自动对 (B, C) 里的每个组合独立执行。

7.3 `individual=True`（独立模式）

python

for i in range(self.channels):
    seasonal_output[:, i, :] = self.Linear_Seasonal[i](seasonal_init[:, i, :])
    trend_output[:, i, :]    = self.Linear_Trend[i](trend_init[:, i, :])

每个变量 i 有自己的 Linear_Seasonal[i]（独立的 W_i）。

seasonal_init[:, i, :] 取出第 i 个变量的所有 batch 序列，shape (B, seq_len) → 送进 Linear → (B, pred_len)。

两种模式的对比：

项目	`individual=False`	`individual=True`
Linear 参数数量	2 个（seasonal+trend 各 1 个）	`2 × C` 个（每个变量各 2 个）
变量间参数	共享	独立
参数量 toy（C=3）	`2×(2×6)=24`	`3×2×(2×6)=72`
适合场景	变量间有相似规律	变量间差异很大
代码路径	直接 `Linear(seasonal_init)`	`for i: Linear[i](...)`

7.4 toy 可算过程（individual=False）

输入（seasonal，batch=0，特征 0）：

seasonal_init[0, 0, :] = [-0.33, 0, 0, 0, 0, +0.33]  ← 长度 6

Linear_Seasonal 权重（toy，初始化值 1/6，训练后示例值）：

为了能精算，假设训练后权重为：

W_S = [ [1/6, 1/6, 1/6, 1/6, 1/6, 1/6],   ← pred step 0 的权重（均匀平均）
       [0,   0,   0,   0,   0,   1  ] ]    ← pred step 1 只看最后一个时间步

计算：

seasonal_output[0, 0, 0] = W_S[0] · seasonal_init[0,0,:]
  = (1/6)*(-0.33) + (1/6)*0 + ... + (1/6)*(0.33)
  = (1/6)*(-0.33 + 0.33) = 0.00

seasonal_output[0, 0, 1] = W_S[1] · seasonal_init[0,0,:]
  = 0*(-0.33) + ... + 1*(0.33) = 0.33

若使用初始化权重（全部 1/6）：

seasonal_output[0, 0, 0] = mean([-0.33,0,0,0,0,0.33]) = 0.00
seasonal_output[0, 0, 1] = 0.00  （同上）

对于 trend，toy batch=0 特征 0 的 trend 序列为：

trend_init[0, 0, :] = [1.33, 2.00, 3.00, 4.00, 5.00, 5.67]

若使用初始化权重（全部 1/6）：

trend_output[0, 0, pred] = mean([1.33, 2.00, 3.00, 4.00, 5.00, 5.67])
  = (1.33+2+3+4+5+5.67)/6 = 21/6 = 3.50

即初始权重时，trend 预测 = 历史 trend 均值，seasonal 预测 = 历史 seasonal 均值（接近 0）。

8. 步骤 4：相加 + 最终 permute

python

x = seasonal_output + trend_output   # (B, C, pred_len) 逐元素相加
return x.permute(0, 2, 1)            # (B, pred_len, C)

最终 permute：

(2, 3, 2)  →  permute(0, 2, 1)  →  (2, 2, 3)
 B  C  L                             B  L  C

toy 里 pred_len=2 ≠ enc_in=3，shape 从 (2,3,2) 变成 (2,2,3)，清晰可见。

为什么最后要 permute 回来？

TFB 框架期望输出格式是 (B, 时间, 变量)，即时间轴在中间。步骤 3 做 Linear 时为了让 seq_len 在最后，临时把格式改成了 (B, C, L)；最后要换回 (B, L, C)。

9. 完整 toy 张量演变总图

输入: x_enc (2, 6, 3)   ← (B=2, seq_len=6, enc_in=3)

── series_decomp ──────────────────────────────────────────────
pad(前后各 1 步): (2, 8, 3)
permute(0,2,1):  (2, 3, 8)
AvgPool1d(k=3):  (2, 3, 6)     ← (8-3)/1+1 = 6
permute(0,2,1):  (2, 6, 3)     ← moving_mean = trend_init

seasonal_init = x_enc - trend_init:
  (2,6,3) - (2,6,3) = (2, 6, 3)

── permute 换轴 ────────────────────────────────────────────────
seasonal_init.permute(0,2,1): (2, 6, 3) → (2, 3, 6)
trend_init.permute(0,2,1):    (2, 6, 3) → (2, 3, 6)

── Linear 外推 ─────────────────────────────────────────────────
Linear_Seasonal(2,3,6): (2, 3, 6) → (2, 3, 2)  ← seq_len=6 → pred_len=2
Linear_Trend   (2,3,6): (2, 3, 6) → (2, 3, 2)

── 相加 + permute ───────────────────────────────────────────────
seasonal_output + trend_output: (2, 3, 2)
permute(0,2,1): (2, 3, 2) → (2, 2, 3)

输出: (2, 2, 3)   ← (B=2, pred_len=2, enc_in=3)

10. 和 Informer 的对照

项目	DLinear	Informer
`x_enc` 之外的输入	不用（`x_mark_enc/x_dec/x_mark_dec` 全被忽略）	四个输入全部使用
embedding	无	DataEmbedding（数值+时间+位置）
主干机制	滑动平均分解 + 线性映射	self-attention + cross-attention
参数量（toy）	`2 × (pred_len × seq_len)` = 2×12 = 24	大量 Linear projection + attention
时间复杂度	O(seq_len)	O(L log L)（ProbAttention）
序列压缩	无	distilling（ConvLayer 减半）

11. 当前层真正要固定什么

series_decomp 返回的是 (seasonal, trend)，顺序是残差在前，均值在后。
补边公式是 (kernel_size-1)//2，toy 里 kernel=3 所以补 1 步。
AvgPool1d 公式：L_out = (L_in - kernel) / stride + 1；补边后 L_out == seq_len（长度不变）。
两次 permute：Linear 前换成 (B, C, L)，Linear 后换回 (B, L, C)，原因是 Linear 作用在最后一维。
初始权重 (1/seq_len)*ones 的含义：初始预测 = 历史均值。
individual=False 所有变量共享同一个 Linear；True 每个变量独立一个 Linear。

12. 下一步

看整条链收束：

09-DLinear全览流程图收束

DLinear_v1_archive

Informer_v1_archive

PatchTST_v1_archive

12-SelfAttention_Family

01-DLinear

02-PatchTST

03-Informer

Level 4 encoder 全链精读

1. 第一性

2. 上下文

原理→代码映射

3. 全链顺序图

4. 完整代码（带中文注释）

5. 步骤 1：`series_decomp`（序列分解）

5.1 完整代码

5.2 三步精算

5.3 `series_decomp` 返回顺序

5.4 两路 shape 小结

6. 步骤 2：`permute(0, 2, 1)`（为什么要换轴）

7. 步骤 3：Linear 线性头

7.1 Linear 的形状和初始化

7.2 `individual=False`（共享模式，当前默认）

7.3 `individual=True`（独立模式）

7.4 toy 可算过程（individual=False）

8. 步骤 4：相加 + 最终 permute

9. 完整 toy 张量演变总图

10. 和 Informer 的对照

11. 当前层真正要固定什么

12. 下一步

Level 4 encoder 全链精读 ​

1. 第一性 ​

2. 上下文 ​

原理→代码映射 ​

3. 全链顺序图 ​

4. 完整代码（带中文注释） ​

5. 步骤 1：series_decomp（序列分解） ​

5.1 完整代码 ​

5.2 三步精算 ​

5.3 series_decomp 返回顺序 ​

5.4 两路 shape 小结 ​

6. 步骤 2：permute(0, 2, 1)（为什么要换轴） ​

7. 步骤 3：Linear 线性头 ​

7.1 Linear 的形状和初始化 ​

7.2 individual=False（共享模式，当前默认） ​

7.3 individual=True（独立模式） ​

7.4 toy 可算过程（individual=False） ​

8. 步骤 4：相加 + 最终 permute ​

9. 完整 toy 张量演变总图 ​

10. 和 Informer 的对照 ​

11. 当前层真正要固定什么 ​

12. 下一步 ​

Level 4 encoder 全链精读

1. 第一性

2. 上下文

原理→代码映射

3. 全链顺序图

4. 完整代码（带中文注释）

5. 步骤 1：`series_decomp`（序列分解）

5.1 完整代码

5.2 三步精算

5.3 `series_decomp` 返回顺序

5.4 两路 shape 小结

6. 步骤 2：`permute(0, 2, 1)`（为什么要换轴）

7. 步骤 3：Linear 线性头

7.1 Linear 的形状和初始化

7.2 `individual=False`（共享模式，当前默认）

7.3 `individual=True`（独立模式）

7.4 toy 可算过程（individual=False）

8. 步骤 4：相加 + 最终 permute

9. 完整 toy 张量演变总图

10. 和 Informer 的对照

11. 当前层真正要固定什么

12. 下一步