Layer 1 — forecast() 主链

1. 在父层中的位置

Nonstationary_Transformer.forward() 判断 task_name == "short_term_forecast" 后调用 self.forecast()，这是整个预测的主干。

2. I/O 接口定义

参数	Shape	含义
`x_enc`	(2, 12, 5)	encoder 输入（归一化前）
`x_mark_enc`	(2, 12, 4)	encoder 时间标记
`x_dec`	(2, 10, 5)	decoder 输入（仅用其形状取 zeros）
`x_mark_dec`	(2, 10, 4)	decoder 时间标记

输出：dec_out shape (2, 10, 5)，由 forward() 截取 [:, -4:, :] → (2, 4, 5)

3. 顺序图

4. 语义分组图

5. 逐步骤精读

§5.0 完整原始代码

python

class Nonstationary_Transformer(nn.Module):
    def __init__(self, configs):
        super(Nonstationary_Transformer, self).__init__()
        self.task_name = configs.task_name
        self.pred_len = configs.pred_len
        self.seq_len = configs.seq_len
        self.label_len = configs.label_len
        self.output_attention = configs.output_attention

        self.enc_embedding = DataEmbedding(
            configs.enc_in, configs.d_model, configs.embed, configs.freq, configs.dropout,
        )
        self.encoder = Encoder(
            [
                EncoderLayer(
                    AttentionLayer(
                        DSAttention(False, configs.factor, attention_dropout=configs.dropout,
                                    output_attention=configs.output_attention),
                        configs.d_model, configs.n_heads,
                    ),
                    configs.d_model, configs.d_ff,
                    dropout=configs.dropout, activation=configs.activation,
                )
                for l in range(configs.e_layers)
            ],
            norm_layer=torch.nn.LayerNorm(configs.d_model),
        )
        self.dec_embedding = DataEmbedding(
            configs.dec_in, configs.d_model, configs.embed, configs.freq, configs.dropout,
        )
        self.decoder = Decoder(
            [
                DecoderLayer(
                    AttentionLayer(
                        DSAttention(True, configs.factor, attention_dropout=configs.dropout,
                                    output_attention=False),
                        configs.d_model, configs.n_heads,
                    ),
                    AttentionLayer(
                        DSAttention(False, configs.factor, attention_dropout=configs.dropout,
                                    output_attention=False),
                        configs.d_model, configs.n_heads,
                    ),
                    configs.d_model, configs.d_ff,
                    dropout=configs.dropout, activation=configs.activation,
                )
                for l in range(configs.d_layers)
            ],
            norm_layer=torch.nn.LayerNorm(configs.d_model),
            projection=nn.Linear(configs.d_model, configs.c_out, bias=True),
        )
        self.tau_learner = Projector(
            enc_in=configs.enc_in, seq_len=configs.seq_len,
            hidden_dims=configs.p_hidden_dims, hidden_layers=configs.p_hidden_layers,
            output_dim=1,
        )
        self.delta_learner = Projector(
            enc_in=configs.enc_in, seq_len=configs.seq_len,
            hidden_dims=configs.p_hidden_dims, hidden_layers=configs.p_hidden_layers,
            output_dim=configs.seq_len,
        )

    def forecast(self, x_enc, x_mark_enc, x_dec, x_mark_dec):
        x_raw = x_enc.clone().detach()

        # Normalization
        mean_enc = x_enc.mean(1, keepdim=True).detach()  # B x 1 x E
        x_enc = x_enc - mean_enc
        std_enc = torch.sqrt(
            torch.var(x_enc, dim=1, keepdim=True, unbiased=False) + 1e-5
        ).detach()  # B x 1 x E
        x_enc = x_enc / std_enc

        tau = self.tau_learner(x_raw, std_enc).exp()
        delta = self.delta_learner(x_raw, mean_enc)

        x_dec_new = (
            torch.cat(
                [
                    x_enc[:, -self.label_len :, :],
                    torch.zeros_like(x_dec[:, -self.pred_len :, :]),
                ],
                dim=1,
            )
            .to(x_enc.device)
            .clone()
        )

        enc_out = self.enc_embedding(x_enc, x_mark_enc)
        enc_out, attns = self.encoder(enc_out, attn_mask=None, tau=tau, delta=delta)

        dec_out = self.dec_embedding(x_dec_new, x_mark_dec)
        dec_out = self.decoder(
            dec_out, enc_out, x_mask=None, cross_mask=None, tau=tau, delta=delta
        )
        dec_out = dec_out * std_enc + mean_enc
        return dec_out

§5.1 宏观逻辑

核心设计意图：归一化必须做（训练稳定性），但非平稳统计量不能就此丢掉。解法是"先存档，后注入"——在归一化之前把原始信号 clone 出来，然后用 Projector 把均值和标准差转化为注意力计算的调制信号，让 Attention 在归一化序列上计算时仍能"感知"原始的分布特性。

小例子（B=1, seq_len=4, enc_in=2），直觉理解：

原始: x = [[1, 10], [2, 20], [3, 30], [4, 40]]
mean = [[2.5, 25]]    std = [[1.12, 11.2]]
x_norm = [[-1.34, -1.34], [-0.45, -0.45], [0.45, 0.45], [1.34, 1.34]]

Projector 从原始的 mean=[2.5, 25] 学出 δ：
  "在这个均值水平下，key=2（大值位置）应该被多关注"
  → δ 在 S=2 位置加正偏置

直觉：均值很大时，模型应更关注末尾高值时间步（趋势上升信号）

完整 shape 变化链：

(2,12,5) → [clone] x_raw (2,12,5) — [mean] (2,1,5) — [std] (2,1,5) — [norm] (2,12,5) — [tau_learner] (2,1) — [delta_learner] (2,12) — [Encoder] (2,12,8) — [Decoder] (2,10,5) — [denorm] (2,10,5)

§5.2 步骤 ① — x_raw 备份

python

x_raw = x_enc.clone().detach()

x_enc 的 shape 为 (2, 12, 5)。

.clone() 创建数据副本（内存独立），.detach() 截断梯度图。

这两步必须都做：.clone() 确保后续对 x_enc 的原地修改不影响 x_raw；.detach() 确保 Projector 学习 tau/delta 的梯度不会流回到 x_enc 的原始值（避免模型通过 Projector 反向"优化"掉归一化的效果）。

toy 值：x_raw[0, :, 0] = 归一化前 batch=0 变量=0 的 12 个时间步数值，保持不变。

§5.3 步骤 ② — Instance Normalization

python

mean_enc = x_enc.mean(1, keepdim=True).detach()  # B x 1 x E
x_enc = x_enc - mean_enc
std_enc = torch.sqrt(
    torch.var(x_enc, dim=1, keepdim=True, unbiased=False) + 1e-5
).detach()  # B x 1 x E
x_enc = x_enc / std_enc

x_enc.mean(1, keepdim=True) 在时间轴（dim=1）取均值，shape 从 (2, 12, 5) → (2, 1, 5)。

.detach() 截断均值的梯度，防止归一化统计量被梯度更新（Batch Norm 的经典做法，这里移植到 Instance Norm 上）。

x_enc - mean_enc：(2,12,5) - (2,1,5) → 广播后 (2,12,5)，每个时间步减去该 batch、该变量的序列均值。

torch.var(x_enc, dim=1, keepdim=True, unbiased=False) 在已中心化的 x_enc 上算有偏方差（除以 N 而非 N-1），shape (2, 1, 5)。加 1e-5 防止除零。

toy 值（batch=0, var=0 的 12 个时间步假设均值约 5.0，标准差约 3.0）：mean_enc[0,0,0] ≈ 5.0，std_enc[0,0,0] ≈ 3.0，x_enc 归一化后[0,:,0] 的值域约 [-1.5, 1.5]。

§5.4 步骤 ③④ — Projector 调用（τ 和 δ）

python

tau = self.tau_learner(x_raw, std_enc).exp()
delta = self.delta_learner(x_raw, mean_enc)

tau_learner 输入为 x_raw (2,12,5) 和 std_enc (2,1,5)，输出 (2, 1)，经 .exp() 保证正数，得 τ shape (2, 1)。

delta_learner 输入为 x_raw (2,12,5) 和 mean_enc (2,1,5)，输出 δ shape (2, 12)。

tau 使用 std_enc（波动幅度）→ 波动大的序列，tau 的学习目标是调整注意力集中程度。
delta 使用 mean_enc（趋势中心）→ 趋势偏高时，delta 调整各 key 位置被关注的程度。

Projector 内部流程详见 [[03A-Layer2A-Projector]]。

§5.5 步骤 ⑤ — 构造 x_dec_new

python

x_dec_new = (
    torch.cat([
        x_enc[:, -self.label_len:, :],
        torch.zeros_like(x_dec[:, -self.pred_len:, :]),
    ], dim=1)
    .to(x_enc.device)
    .clone()
)

注意：此处用的是归一化后的 x_enc（步骤 ②）。

x_enc[:, -6:, :] 取归一化后 x_enc 的最后 6 步，shape (2, 6, 5)。

torch.zeros_like(x_dec[:, -4:, :]) 从 x_dec 取形状创建零张量，shape (2, 4, 5)。

torch.cat([(2,6,5), (2,4,5)], dim=1) → x_dec_new shape (2, 10, 5)。

为什么用归一化后的 x_enc 末段？

decoder 的历史前缀需要和 encoder 输出处于同一归一化空间。encoder 看到的是归一化后的序列，decoder 也应该从同空间的历史段出发，否则 cross-attention 的 Q/K 会处于不同数值范围。

§5.6 步骤 ⑥ — Encoder

python

enc_out = self.enc_embedding(x_enc, x_mark_enc)
enc_out, attns = self.encoder(enc_out, attn_mask=None, tau=tau, delta=delta)

enc_embedding：DataEmbedding 将 (2,12,5) + 时间标记 (2,12,4) → (2,12,8)。包含 TokenEmbedding（Conv1d）+ TemporalEmbedding + PositionalEmbedding，三者相加。

self.encoder：Encoder.forward() 无 conv_layers，走 else 分支，将 tau/delta 传给每一个 EncoderLayer：

python

for attn_layer in self.attn_layers:
    x, attn = attn_layer(x, attn_mask=attn_mask, tau=tau, delta=delta)

每个 EncoderLayer 内：DSAttention 接收 tau/delta，在 score 计算时使用。

输入 (2,12,8) → e_layers=2 次 DSAttention + FFN → 输出 (2,12,8)（形状不变，无 distilling）。

§5.7 步骤 ⑦ — Decoder

python

dec_out = self.dec_embedding(x_dec_new, x_mark_dec)
dec_out = self.decoder(dec_out, enc_out, x_mask=None, cross_mask=None, tau=tau, delta=delta)

dec_embedding：(2,10,5) + (2,10,4) → (2,10,8)，同 Encoder 的 DataEmbedding。

self.decoder：d_layers=1 个 DecoderLayer，每层包含：

self-attention：DSAttention(mask_flag=True, delta=None) — 仅用 tau，不用 delta
cross-attention：DSAttention(mask_flag=False, delta=delta) — tau 和 delta 都用

为什么 self-attention 不用 delta？

delta 的 shape 是 (2, 12)，其中 12 = seq_len（encoder 输入长度）。 self-attention 中 Q/K 都来自 decoder，序列长度为 dec_len=10 ≠ 12。若把 delta (B,1,1,12) 广播到 scores (B,4,10,10) 的最后维，维度不匹配（12 ≠ 10）。所以 delta 只能用于 cross-attention（K/V 来自 encoder，seq_len=12 对应 S 维度）。
代码实现验证（Transformer_EncDec.py）：
python
x = x + self.dropout(
    self.self_attention(x, x, x, attn_mask=x_mask, tau=tau, delta=None)[0]
)
x = x + self.dropout(
    self.cross_attention(x, cross, cross, attn_mask=cross_mask, tau=tau, delta=delta)[0]
)

Decoder 最终包含 LayerNorm + projection Linear(8→5)，输出 (2, 10, 5)。

§5.8 步骤 ⑧ — 反归一化

python

dec_out = dec_out * std_enc + mean_enc

dec_out shape (2, 10, 5)，std_enc (2, 1, 5)，mean_enc (2, 1, 5)。

广播后将模型的归一化空间预测值还原回原始数值范围。

toy 值：若 std_enc[0,0,0] ≈ 3.0，mean_enc[0,0,0] ≈ 5.0，dec_out 中某预测值 0.8，则还原后 = 0.8 × 3.0 + 5.0 = 7.4。

返回 dec_out shape (2, 10, 5)，由 forward() 切 [:, -4:, :] 得最终输出 (2, 4, 5)。

6. 下钻子组件

子组件	职责	文档
`Projector`	Conv1d 聚合 + MLP → tau/delta	[[03A-Layer2A-Projector]]
`DSAttention`	score × τ + δ 的完整实现	[[03B-Layer2B-DSAttention]]

DLinear_v1_archive

Informer_v1_archive

PatchTST_v1_archive

12-SelfAttention_Family

01-DLinear

02-PatchTST

03-Informer

Layer 1 — forecast() 主链

1. 在父层中的位置

2. I/O 接口定义

3. 顺序图

4. 语义分组图

5. 逐步骤精读

§5.0 完整原始代码

§5.1 宏观逻辑

§5.2 步骤 ① — x_raw 备份

§5.3 步骤 ② — Instance Normalization

§5.4 步骤 ③④ — Projector 调用（τ 和 δ）

§5.5 步骤 ⑤ — 构造 x_dec_new

§5.6 步骤 ⑥ — Encoder

§5.7 步骤 ⑦ — Decoder

§5.8 步骤 ⑧ — 反归一化

6. 下钻子组件

Layer 1 — forecast() 主链 ​

1. 在父层中的位置 ​

2. I/O 接口定义 ​

3. 顺序图 ​

4. 语义分组图 ​

5. 逐步骤精读 ​

§5.0 完整原始代码 ​

§5.1 宏观逻辑 ​

§5.2 步骤 ① — x_raw 备份 ​

§5.3 步骤 ② — Instance Normalization ​

§5.4 步骤 ③④ — Projector 调用（τ 和 δ） ​

§5.5 步骤 ⑤ — 构造 x_dec_new ​

§5.6 步骤 ⑥ — Encoder ​

§5.7 步骤 ⑦ — Decoder ​

§5.8 步骤 ⑧ — 反归一化 ​

6. 下钻子组件 ​

Layer 1 — forecast() 主链

1. 在父层中的位置

2. I/O 接口定义

3. 顺序图

4. 语义分组图

5. 逐步骤精读

§5.0 完整原始代码

§5.1 宏观逻辑

§5.2 步骤 ① — x_raw 备份

§5.3 步骤 ② — Instance Normalization

§5.4 步骤 ③④ — Projector 调用（τ 和 δ）

§5.5 步骤 ⑤ — 构造 x_dec_new

§5.6 步骤 ⑥ — Encoder

§5.7 步骤 ⑦ — Decoder

§5.8 步骤 ⑧ — 反归一化

6. 下钻子组件