Appearance
DUET · Layer 2C — Channel_transformer(带掩码变量 Transformer)
§1 在父层中的位置
DUETModel.forward() 通道路径的第二步(第一步是 [[03B-Layer2B-MahalanobisMask]] 生成掩码):
python
channel_group_feature, attention = self.Channel_transformer(
x=temporal_feature, # (3, 7, 8)
attn_mask=channel_mask # (3, 1, 7, 7)
)self.Channel_transformer 是 Encoder 实例,含 e_layers=2 个 EncoderLayer,末尾接 LayerNorm(d_model=8)。
§2 I/O 接口定义
python
Encoder.forward(x, attn_mask) -> (Tensor, list)| 参数 | shape | 含义 |
|---|---|---|
x | (B, N, d_model) = (3, 7, 8) | 每个变量的时序特征向量(MoE 输出) |
attn_mask | (B, 1, N, N) = (3, 1, 7, 7) | Mahalanobis 0/1 通道掩码 |
| 返回 x | (3, 7, 8) | 经过变量间注意力后的特征 |
| 返回 attns | list[2] | 各层注意力权重(output_attention=0 时为 None) |
token 语义:变量,不是时间步
这个 Encoder 和 iTransformer 的 Encoder 在代码结构上完全相同,关键差异在于 token 的语义:
- iTransformer:每个 token = 1 个变量的完整历史序列(长度 L 的向量嵌入为 d_model)
- DUET Channel_transformer:每个 token = MoE 路径输出的变量特征(d_model 维向量)
两者的注意力矩阵都是
,但 DUET 额外叠加了 Mahalanobis 掩码,使弱相关的变量对无法互相传递信息。
§3 顺序图(具体层)
§4 语义分组图(索引层)
§5 逐步骤精读
§5.0 完整原始代码
python
class Encoder(nn.Module):
def __init__(self, attn_layers, conv_layers=None, norm_layer=None):
super(Encoder, self).__init__()
self.attn_layers = nn.ModuleList(attn_layers)
self.conv_layers = (
nn.ModuleList(conv_layers) if conv_layers is not None else None
)
self.norm = norm_layer
def forward(self, x, attn_mask=None, tau=None, delta=None):
attns = []
if self.conv_layers is not None:
for i, (attn_layer, conv_layer) in enumerate(
zip(self.attn_layers, self.conv_layers)
):
delta = delta if i == 0 else None
x, attn = attn_layer(x, attn_mask=attn_mask, tau=tau, delta=delta)
x = conv_layer(x)
attns.append(attn)
x, attn = self.attn_layers[-1](x, tau=tau, delta=None)
attns.append(attn)
else:
for attn_layer in self.attn_layers:
x, attn = attn_layer(x, attn_mask=attn_mask, tau=tau, delta=delta)
attns.append(attn)
if self.norm is not None:
x = self.norm(x)
return x, attns
class EncoderLayer(nn.Module):
def __init__(self, attention, d_model, d_ff=None, dropout=0.1, activation="relu"):
super(EncoderLayer, self).__init__()
d_ff = d_ff or 4 * d_model
self.attention = attention
self.conv1 = nn.Conv1d(in_channels=d_model, out_channels=d_ff, kernel_size=1)
self.conv2 = nn.Conv1d(in_channels=d_ff, out_channels=d_model, kernel_size=1)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
self.activation = F.relu if activation == "relu" else F.gelu
def forward(self, x, attn_mask=None, tau=None, delta=None):
new_x, attn = self.attention(x, x, x, attn_mask=attn_mask, tau=tau, delta=delta)
x = x + self.dropout(new_x)
y = x = self.norm1(x)
y = self.dropout(self.activation(self.conv1(y.transpose(-1, 1))))
y = self.dropout(self.conv2(y).transpose(-1, 1))
return self.norm2(x + y), attn
class AttentionLayer(nn.Module):
def __init__(self, attention, d_model, n_heads, d_keys=None, d_values=None):
super(AttentionLayer, self).__init__()
d_keys = d_keys or (d_model // n_heads)
d_values = d_values or (d_model // n_heads)
self.inner_attention = attention
self.query_projection = nn.Linear(d_model, d_keys * n_heads)
self.key_projection = nn.Linear(d_model, d_keys * n_heads)
self.value_projection = nn.Linear(d_model, d_values * n_heads)
self.out_projection = nn.Linear(d_values * n_heads, d_model)
self.n_heads = n_heads
def forward(self, queries, keys, values, attn_mask, tau=None, delta=None):
B, L, _ = queries.shape
_, S, _ = keys.shape
H = self.n_heads
queries = self.query_projection(queries).view(B, L, H, -1)
keys = self.key_projection(keys).view(B, S, H, -1)
values = self.value_projection(values).view(B, S, H, -1)
out, attn = self.inner_attention(
queries, keys, values, attn_mask, tau=tau, delta=delta
)
out = out.view(B, L, -1)
return self.out_projection(out), attn
class FullAttention(nn.Module):
def __init__(self, mask_flag=True, factor=5, scale=None,
attention_dropout=0.1, output_attention=False):
super(FullAttention, self).__init__()
self.scale = scale
self.mask_flag = mask_flag
self.output_attention = output_attention
self.dropout = nn.Dropout(attention_dropout)
def forward(self, queries, keys, values, attn_mask, tau=None, delta=None):
B, L, H, E = queries.shape
_, S, _, D = values.shape
scale = self.scale or 1.0 / sqrt(E)
scores = torch.einsum("blhe,bshe->bhls", queries, keys)
if self.mask_flag:
large_negative = -math.log(1e10)
attention_mask = torch.where(attn_mask == 0, large_negative, 0)
scores = scores * attn_mask + attention_mask
A = self.dropout(torch.softmax(scale * scores, dim=-1))
V = torch.einsum("bhls,bshd->blhd", A, values)
if self.output_attention:
return V.contiguous(), A
else:
return V.contiguous(), None⚠️ Encoder 有两条代码路径,DUET 走无 distilling 路径
Encoder.forward有if self.conv_layers is not None分支(用于 Informer distilling)。DUET 实例化时未传conv_layers,故self.conv_layers = None,走else分支(简单循环 attn_layers,序列长度不变)。
§5.1 宏观逻辑
一句话目标:以 N=7 个变量为 token,用带 Mahalanobis 掩码的多头注意力做变量间信息交换,让强相关的通道互相增强特征表示,弱相关的通道则被隔离。
注意力复杂度分析:
注意力矩阵大小
标准 Transformer 注意力矩阵大小 = 序列长度
的平方。Channel_transformer 的 token 是变量( ),不是时间步( ): 对比:若按时间步做注意力,则为
元素。以变量为 token 大幅减少计算量(在 时),同时 Mahalanobis 掩码进一步稀疏化——有效关注的通道对数量 sum(mask[b]),通常远小于 。
e_layers=2 的作用:第 1 层 EncoderLayer 让每个变量聚合其强相关邻居的特征,第 2 层在此基础上做二阶聚合(邻居的邻居)。两层足够捕捉局部变量图的两跳关系,且不引入过度平滑。
§5.2 AttentionLayer:Q/K/V 投影与多头拆分
输入 x (3, 7, 8) 同时作为 queries/keys/values(自注意力):
python
queries = self.query_projection(queries).view(B, L, H, -1)
keys = self.key_projection(keys).view(B, S, H, -1)
values = self.value_projection(values).view(B, S, H, -1)各投影层:Linear(d_model=8, d_keys*n_heads=4×2=8) — 参数量 8×8+8=72。
shape 追踪(d_keys = d_model // n_heads = 8 // 2 = 4):
(3, 7, 8) → Linear → (3, 7, 8) → .view(3, 7, 2, 4)
B L H E=d_keysASCII 图解 — 多头拆分:
Linear 输出 (3, 7, 8):
对第 0 个 batch,第 0 个 token(变量 0):
[..., q0, q1, q2, q3, | q4, q5, q6, q7 ] ← 8 个值
← Head 0 → | ← Head 1 →
d_keys=4 d_keys=4
.view(3, 7, 2, 4):
queries[0, 0, 0, :] = [q0, q1, q2, q3] ← Head 0,变量 0
queries[0, 0, 1, :] = [q4, q5, q6, q7] ← Head 1,变量 0§5.3 FullAttention:带掩码的注意力计算
Step 1 — 注意力分数
python
scores = torch.einsum("blhe,bshe->bhls", queries, keys)(3, 7, 2, 4) × (3, 7, 2, 4) → (3, 2, 7, 7) [B, H, L=N, S=N]
toy 数值(仅展示 B=0, H=0 的 7×7 子矩阵,scale = 1/√4 = 0.5):
scores[0, 0, i, j] = q[0,i,0,:] · k[0,j,0,:] (Head 0 的点积)
初始化后值域约 (-∞, +∞),训练前约 N(0, 1/4)Step 2 — 应用 Mahalanobis 掩码
python
large_negative = -math.log(1e10) # ≈ -23.03
attention_mask = torch.where(attn_mask == 0, large_negative, 0)
scores = scores * attn_mask + attention_maskattn_mask (3, 1, 7, 7) 广播到 (3, 2, 7, 7):
例:attn_mask[0, 0, :, :] =
[[1, 1, 0, 1, 0, 1, 1], ← 变量 0 关注 {0,1,3,5,6}
[1, 1, 1, 0, 0, 1, 0], ← 变量 1 关注 {0,1,2,5}
...
[1, 0, 0, 1, 1, 0, 1]] ← 变量 6 关注 {0,3,4,6}
掩码应用后:
mask=1 处: scores[b,h,i,j] × 1 + 0 = 原始分数
mask=0 处: scores[b,h,i,j] × 0 + (-23.03) = -23.03Step 3 — Softmax + 值加权
python
A = self.dropout(torch.softmax(scale * scores, dim=-1)) # (3, 2, 7, 7)
V = torch.einsum("bhls,bshd->blhd", A, values) # (3, 7, 2, 4)softmax 沿最后一维(S=7):掩码位置
toy 数值(mask 后的 softmax,第 0 个 batch、第 0 个 head、变量 0):
masked_scores[0,0,0,:] = [s0, s1, -23.03, s3, -23.03, s5, s6]
softmax → a ≈ [a0, a1, ≈0, a3, ≈0, a5, a6],有效权重仅在 5 个位置Step 4 — 输出投影
python
out = out.view(B, L, -1) # (3, 7, 2, 4) → (3, 7, 8)
return self.out_projection(out), attn # Linear(8, 8) → (3, 7, 8)§5.4 EncoderLayer:残差 + FFN
python
def forward(self, x, attn_mask=None, ...):
new_x, attn = self.attention(x, x, x, attn_mask=attn_mask, ...)
x = x + self.dropout(new_x) # 第 1 条残差
y = x = self.norm1(x) # norm1,同时赋给 x 和 y
y = self.dropout(self.activation(self.conv1(y.transpose(-1, 1))))
y = self.dropout(self.conv2(y).transpose(-1, 1))
return self.norm2(x + y), attn # 第 2 条残差 + norm2第 1 条残差:
x_input (3,7,8)
↓ attention (掩码自注意力)
new_x (3,7,8)
↓ dropout
x = x_input + dropout(new_x) → (3,7,8)FFN 的 Conv1d 与 transpose:
nn.Conv1d 要求输入格式 (B, C_in, L)。当前 x 格式是 (B, N, d_model) = (3, 7, 8),这里 N=7 是"序列长度"、d_model=8 是"通道":
y.transpose(-1, 1):
(3, 7, 8) → (3, 8, 7) ← C=8, L=7 (Conv1d 格式)
conv1 = Conv1d(8→32, kernel=1):
(3, 8, 7) → (3, 32, 7) ← point-wise(kernel=1 等价于 Linear)
activation + dropout → (3, 32, 7)
conv2 = Conv1d(32→8, kernel=1):
(3, 32, 7) → (3, 8, 7)
.transpose(-1, 1):
(3, 8, 7) → (3, 7, 8) ← 还原格式Conv1d(kernel=1) ≡ Point-wise Linear
当 kernel_size=1 时,Conv1d 对每个位置独立应用相同的线性变换,等价于
nn.Linear作用于d_ff维。使用 Conv1d 的原因是历史遗留(Transformer 最初的 FFN 用 Conv1d 实现),在语义上与Linear完全等价,但需要额外两次 transpose。
第 2 条残差:
x (来自 norm1,赋值技巧: y = x = norm1(x))
+ dropout(y) ← FFN 输出
→ norm2(x + y) → (3, 7, 8)y = x = self.norm1(x) 的引用语义
y = x = self.norm1(x) 的引用语义这行代码先计算
norm1(x)赋给x(覆盖原来的x),同时也赋给y。此时x和y指向同一个 tensor。接下来y = ...重新绑定y到 FFN 输出,但x仍然是 norm1 的输出。最终x + y= norm1 输出 + FFN 输出,实现了第 2 条残差连接。这是常见的 Python 引用复用技巧,不是原地修改。
toy 数值追踪(单层 EncoderLayer,B=0, token=变量 0):
x[0, 0, :] = [f0, f1, ..., f7] ← 变量 0 的 MoE 特征(8 维)
注意力后 new_x[0, 0, :] = 被相关变量信息加权的混合特征(8 维)
x → x + dropout(new_x) → norm1 → y 同步
FFN: (8) → Conv1d(k=1) → (32) → GELU → (32) → Conv1d(k=1) → (8)
最终 norm2(x + y): 维持 (3, 7, 8)§5.5 Encoder 循环:e_layers=2
DUET 配置中 e_layers=2,conv_layers=None:
python
for attn_layer in self.attn_layers: # 循环 2 次
x, attn = attn_layer(x, attn_mask=attn_mask, ...)
attns.append(attn)
if self.norm is not None:
x = self.norm(x) # LayerNorm(8)
return x, attnsshape 全程保持 (3, 7, 8),掩码在两层中复用同一个 attn_mask (3, 1, 7, 7)。
attns 是长度为 2 的列表,每项是 None(因为 output_attention=0 → False)。DUETModel.forward() 接收该列表但用变量名 attention 接收后丢弃不使用。
§6 下钻子组件
本层四个类(Encoder / EncoderLayer / AttentionLayer / FullAttention)均已在 §5 完整精读,无需另开文档。
创建:2026-04-24