Informer 总览

1. 论文问题与动机

标准 Transformer（注意力 + Encoder-Decoder）做长序列预测面临两个瓶颈：

① 全量注意力的复杂度是 O(L²)：序列长 512，注意力矩阵就是 512×512=26 万个元素，内存和计算都是平方增长。

② Decoder 是步进式的（autoregressive）：每次只预测一个时间步，预测 96 步就要运行 96 次 forward，推理很慢。

Informer 的两个核心创新：

创新 1 — ProbSparse Self-Attention（稀疏注意力）

标准注意力：每个 query 都和所有 key 做点积，O(L²)。

论文观察：实际上大多数 query 的注意力分布都很"平"（接近均匀分布）—— 这些 query 没有真正关注任何特定 key，对输出贡献很小。只有少数 query 是"活跃"的（分布很尖锐）。

ProbSparse 做法：用稀疏度指标 M 选出 Top-u 个活跃 query，只为这些 query 计算完整注意力；其余"懒" query 直接用 V 的均值作为输出。

全量注意力（8个query）：
Q₀ ──→ K₀ K₁ K₂ K₃ K₄ K₅ K₆ K₇   ← 每行都算
Q₁ ──→ K₀ K₁ K₂ K₃ K₄ K₅ K₆ K₇
...
Q₇ ──→ K₀ K₁ K₂ K₃ K₄ K₅ K₆ K₇   (8×8=64次点积)

ProbSparse（factor=1, seq=8, u=3）：
   先随机采样 3 个 key 计算稀疏度得分 M
   M 值最大的 3 个 query → 用全部 K 计算完整注意力
   其余 5 个 query → 输出直接用 mean(V)
   (3×8=24次全量点积，节省 62.5%)

复杂度：O(L log L)

创新 2 — Distilling（Encoder 蒸馏）

多层 Encoder 时，每层之间插入一个 ConvLayer（Conv1d + MaxPool），把序列长度压缩一半：

EncoderLayer 0:  seq_len=10 → (3, 10, d_model)
     ↓ ConvLayer：Conv1d(k=3,p=2,circular) → 12；MaxPool(k=3,s=2,p=1) → 6
     ↓ L_conv  = 10 + 2×2 - 3 + 1 = 12
     ↓ L_pool  = floor((12 + 2×1 - 3)/2 + 1) = floor(11/2 + 1) = 6
EncoderLayer 1:  seq_len=6  → (3,  6, d_model)
     ↓ 最后一层 EncoderLayer，无 ConvLayer

好处：每层的注意力矩阵越来越小（10×10 → 6×6）

创新 3 — Generative Decoder（并行解码）

不是逐步预测（autoregressive），而是一次性把"历史尾巴 + 占位零"拼在一起喂给 decoder：

Decoder 输入 = [label_len=5 个历史 token] + [pred_len=7 个零 token]
              = (3, 12, enc_in=6)

decoder 里用 masked self-attention（只能看到前面的 token）
最后取 dec_out[:, -pred_len:, :] = (3, 7, 6)

一次 forward 直接得到全部 7 步预测，比逐步快 pred_len 倍。

2. 论文架构图（原理层）

3. TFB 调用链

4. 文档 BFS 索引树

5. 论文组件 → 代码对应表

论文组件	代码类/函数	精读文档
ProbSparse Self-Attention	`ProbAttention` (SelfAttention_Family.py:95)	04A-Layer5-ProbAttention
Encoder Distilling	`ConvLayer` (Transformer_EncDec.py:6) + `Encoder.forward` distil 分支	03B-Layer2B-Encoder
Generative Decoder	`Decoder` + `DecoderLayer` (Transformer_EncDec.py:85,126)	03C-Layer2C-Decoder
Data Embedding	`DataEmbedding` (Embed.py:118)	03A-Layer2A-DataEmbedding
Encoder-Decoder Attention	`AttentionLayer(ProbAttention(...))` in DecoderLayer	03C-Layer2C-Decoder §5.3
接入 TFB	`TransformerAdapter` + `Informer.__init__`	01-Layer0-接入界面

6. 全局 toy 参数

参数	值	说明
B	3	batch size
seq_len	10	encoder 历史输入长度
label_len	5	decoder 的历史部分（从 x_enc 尾部取）
pred_len	7	预测步数
dec_input_len	12	= label_len + pred_len，decoder 输入总长
enc_in / dec_in / c_out	6	变量数
d_model	8	embedding 维度
n_heads	4	注意力头数
d_keys = d_values	2	= d_model // n_heads = 8 // 4
d_ff	24	FFN 中间维度
e_layers	2	encoder 层数（配合 distil）
d_layers	1	decoder 层数
factor	1	ProbSparse 稀疏因子
distil	True	开启 Encoder distilling
embed	"timeF"	时间嵌入类型

派生维度（必须和主 toy 不同）：

蒸馏后 seq：Conv1d(k=3,p=2)→12；MaxPool(k=3,s=2,p=1)→floor((12+2-3)/2+1) = 6 ← 与所有参数不同 ✓
ProbSparse u（L=10）：factor × ceil(ln(10)) = 1 × ceil(2.30) = 3 ← ≠ 其他 ✓
ProbSparse u（L=6）：factor × ceil(ln(6)) = 1 × ceil(1.79) = 2 = d_keys ← 注意 d_keys=2 重叠，读代码时注意区分语义

7. 推荐阅读路径

快速了解直觉版（15 分钟）：

本文 §1（论文动机 + 3 张 ASCII 图）
03B-Layer2B-Encoder §5.1 宏观逻辑（distilling 是什么）
04A-Layer5-ProbAttention §5.1 宏观逻辑（稀疏选取是什么）

完整代码精读版：

本文 → 01 → 02 → 03A → 03B → 03B1 → 04 → 04A → 03C → 05

DLinear_v1_archive

Informer_v1_archive

PatchTST_v1_archive

12-SelfAttention_Family

01-DLinear

02-PatchTST

03-Informer

Informer 总览

1. 论文问题与动机

创新 1 — ProbSparse Self-Attention（稀疏注意力）

创新 2 — Distilling（Encoder 蒸馏）

创新 3 — Generative Decoder（并行解码）

2. 论文架构图（原理层）

3. TFB 调用链

4. 文档 BFS 索引树

5. 论文组件 → 代码对应表

6. 全局 toy 参数

7. 推荐阅读路径

Informer 总览 ​

1. 论文问题与动机 ​

创新 1 — ProbSparse Self-Attention（稀疏注意力） ​

创新 2 — Distilling（Encoder 蒸馏） ​

创新 3 — Generative Decoder（并行解码） ​

2. 论文架构图（原理层） ​

3. TFB 调用链 ​

4. 文档 BFS 索引树 ​

5. 论文组件 → 代码对应表 ​

6. 全局 toy 参数 ​

7. 推荐阅读路径 ​

Informer 总览

1. 论文问题与动机

创新 1 — ProbSparse Self-Attention（稀疏注意力）

创新 2 — Distilling（Encoder 蒸馏）

创新 3 — Generative Decoder（并行解码）

2. 论文架构图（原理层）

3. TFB 调用链

4. 文档 BFS 索引树

5. 论文组件 → 代码对应表

6. 全局 toy 参数

7. 推荐阅读路径