DUET · Layer 2C — Channel_transformer（带掩码变量 Transformer）

§1 在父层中的位置

DUETModel.forward() 通道路径的第二步（第一步是 [[03B-Layer2B-MahalanobisMask]] 生成掩码）：

python

channel_group_feature, attention = self.Channel_transformer(
    x=temporal_feature,     # (3, 7, 8)
    attn_mask=channel_mask  # (3, 1, 7, 7)
)

self.Channel_transformer 是 Encoder 实例，含 e_layers=2 个 EncoderLayer，末尾接 LayerNorm(d_model=8)。

§2 I/O 接口定义

python

Encoder.forward(x, attn_mask) -> (Tensor, list)

参数	shape	含义
`x`	`(B, N, d_model)` = `(3, 7, 8)`	每个变量的时序特征向量（MoE 输出）
`attn_mask`	`(B, 1, N, N)` = `(3, 1, 7, 7)`	Mahalanobis 0/1 通道掩码
返回 x	`(3, 7, 8)`	经过变量间注意力后的特征
返回 attns	`list[2]`	各层注意力权重（output_attention=0 时为 None）

token 语义：变量，不是时间步

这个 Encoder 和 iTransformer 的 Encoder 在代码结构上完全相同，关键差异在于 token 的语义：
iTransformer：每个 token = 1 个变量的完整历史序列（长度 L 的向量嵌入为 d_model）
DUET Channel_transformer：每个 token = MoE 路径输出的变量特征（d_model 维向量）
两者的注意力矩阵都是 $N \times N$ ，但 DUET 额外叠加了 Mahalanobis 掩码，使弱相关的变量对无法互相传递信息。

§3 顺序图（具体层）

§4 语义分组图（索引层）

§5 逐步骤精读

§5.0 完整原始代码

python

class Encoder(nn.Module):
    def __init__(self, attn_layers, conv_layers=None, norm_layer=None):
        super(Encoder, self).__init__()
        self.attn_layers = nn.ModuleList(attn_layers)
        self.conv_layers = (
            nn.ModuleList(conv_layers) if conv_layers is not None else None
        )
        self.norm = norm_layer

    def forward(self, x, attn_mask=None, tau=None, delta=None):
        attns = []
        if self.conv_layers is not None:
            for i, (attn_layer, conv_layer) in enumerate(
                zip(self.attn_layers, self.conv_layers)
            ):
                delta = delta if i == 0 else None
                x, attn = attn_layer(x, attn_mask=attn_mask, tau=tau, delta=delta)
                x = conv_layer(x)
                attns.append(attn)
            x, attn = self.attn_layers[-1](x, tau=tau, delta=None)
            attns.append(attn)
        else:
            for attn_layer in self.attn_layers:
                x, attn = attn_layer(x, attn_mask=attn_mask, tau=tau, delta=delta)
                attns.append(attn)
        if self.norm is not None:
            x = self.norm(x)
        return x, attns


class EncoderLayer(nn.Module):
    def __init__(self, attention, d_model, d_ff=None, dropout=0.1, activation="relu"):
        super(EncoderLayer, self).__init__()
        d_ff = d_ff or 4 * d_model
        self.attention = attention
        self.conv1 = nn.Conv1d(in_channels=d_model, out_channels=d_ff, kernel_size=1)
        self.conv2 = nn.Conv1d(in_channels=d_ff, out_channels=d_model, kernel_size=1)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        self.activation = F.relu if activation == "relu" else F.gelu

    def forward(self, x, attn_mask=None, tau=None, delta=None):
        new_x, attn = self.attention(x, x, x, attn_mask=attn_mask, tau=tau, delta=delta)
        x = x + self.dropout(new_x)

        y = x = self.norm1(x)
        y = self.dropout(self.activation(self.conv1(y.transpose(-1, 1))))
        y = self.dropout(self.conv2(y).transpose(-1, 1))

        return self.norm2(x + y), attn


class AttentionLayer(nn.Module):
    def __init__(self, attention, d_model, n_heads, d_keys=None, d_values=None):
        super(AttentionLayer, self).__init__()
        d_keys = d_keys or (d_model // n_heads)
        d_values = d_values or (d_model // n_heads)
        self.inner_attention = attention
        self.query_projection = nn.Linear(d_model, d_keys * n_heads)
        self.key_projection = nn.Linear(d_model, d_keys * n_heads)
        self.value_projection = nn.Linear(d_model, d_values * n_heads)
        self.out_projection = nn.Linear(d_values * n_heads, d_model)
        self.n_heads = n_heads

    def forward(self, queries, keys, values, attn_mask, tau=None, delta=None):
        B, L, _ = queries.shape
        _, S, _ = keys.shape
        H = self.n_heads
        queries = self.query_projection(queries).view(B, L, H, -1)
        keys = self.key_projection(keys).view(B, S, H, -1)
        values = self.value_projection(values).view(B, S, H, -1)
        out, attn = self.inner_attention(
            queries, keys, values, attn_mask, tau=tau, delta=delta
        )
        out = out.view(B, L, -1)
        return self.out_projection(out), attn


class FullAttention(nn.Module):
    def __init__(self, mask_flag=True, factor=5, scale=None,
                 attention_dropout=0.1, output_attention=False):
        super(FullAttention, self).__init__()
        self.scale = scale
        self.mask_flag = mask_flag
        self.output_attention = output_attention
        self.dropout = nn.Dropout(attention_dropout)

    def forward(self, queries, keys, values, attn_mask, tau=None, delta=None):
        B, L, H, E = queries.shape
        _, S, _, D = values.shape
        scale = self.scale or 1.0 / sqrt(E)
        scores = torch.einsum("blhe,bshe->bhls", queries, keys)
        if self.mask_flag:
            large_negative = -math.log(1e10)
            attention_mask = torch.where(attn_mask == 0, large_negative, 0)
            scores = scores * attn_mask + attention_mask
        A = self.dropout(torch.softmax(scale * scores, dim=-1))
        V = torch.einsum("bhls,bshd->blhd", A, values)
        if self.output_attention:
            return V.contiguous(), A
        else:
            return V.contiguous(), None

⚠️ Encoder 有两条代码路径，DUET 走无 distilling 路径

Encoder.forward 有 if self.conv_layers is not None 分支（用于 Informer distilling）。DUET 实例化时未传 conv_layers，故 self.conv_layers = None，走 else 分支（简单循环 attn_layers，序列长度不变）。

§5.1 宏观逻辑

一句话目标：以 N=7 个变量为 token，用带 Mahalanobis 掩码的多头注意力做变量间信息交换，让强相关的通道互相增强特征表示，弱相关的通道则被隔离。

注意力复杂度分析：

注意力矩阵大小

标准 Transformer 注意力矩阵大小 = 序列长度 $L$ 的平方。Channel_transformer 的 token 是变量（ $N = 7$ ），不是时间步（ $L = 16$ ）：
$注意力矩阵 \in R^{N \times N} = 7 \times 7 = 49 元素$
对比：若按时间步做注意力，则为 $16 \times 16 = 256$ 元素。以变量为 token 大幅减少计算量（在 $N ≪ L$ 时），同时 Mahalanobis 掩码进一步稀疏化——有效关注的通道对数量 $\approx$ sum(mask[b])，通常远小于 $N^{2}$ 。

e_layers=2 的作用：第 1 层 EncoderLayer 让每个变量聚合其强相关邻居的特征，第 2 层在此基础上做二阶聚合（邻居的邻居）。两层足够捕捉局部变量图的两跳关系，且不引入过度平滑。

§5.2 `AttentionLayer`：Q/K/V 投影与多头拆分

输入 x (3, 7, 8) 同时作为 queries/keys/values（自注意力）：

python

queries = self.query_projection(queries).view(B, L, H, -1)
keys    = self.key_projection(keys).view(B, S, H, -1)
values  = self.value_projection(values).view(B, S, H, -1)

各投影层：Linear(d_model=8, d_keys*n_heads=4×2=8) — 参数量 8×8+8=72。

shape 追踪（d_keys = d_model // n_heads = 8 // 2 = 4）：

(3, 7, 8) → Linear → (3, 7, 8) → .view(3, 7, 2, 4)
                                        B  L  H  E=d_keys

ASCII 图解 — 多头拆分：

Linear 输出 (3, 7, 8):
  对第 0 个 batch，第 0 个 token（变量 0）：
  [..., q0, q1, q2, q3, | q4, q5, q6, q7 ]  ← 8 个值
                          ←  Head 0 → | ← Head 1 →
                            d_keys=4    d_keys=4

.view(3, 7, 2, 4):
  queries[0, 0, 0, :] = [q0, q1, q2, q3]   ← Head 0，变量 0
  queries[0, 0, 1, :] = [q4, q5, q6, q7]   ← Head 1，变量 0

§5.3 `FullAttention`：带掩码的注意力计算

Step 1 — 注意力分数

python

scores = torch.einsum("blhe,bshe->bhls", queries, keys)

(3, 7, 2, 4) × (3, 7, 2, 4) → (3, 2, 7, 7) [B, H, L=N, S=N]

toy 数值（仅展示 B=0, H=0 的 7×7 子矩阵，scale = 1/√4 = 0.5）：

scores[0, 0, i, j] = q[0,i,0,:] · k[0,j,0,:]  （Head 0 的点积）
  初始化后值域约 (-∞, +∞)，训练前约 N(0, 1/4)

Step 2 — 应用 Mahalanobis 掩码

python

large_negative = -math.log(1e10)   # ≈ -23.03
attention_mask = torch.where(attn_mask == 0, large_negative, 0)
scores = scores * attn_mask + attention_mask

attn_mask (3, 1, 7, 7) 广播到 (3, 2, 7, 7)：

例：attn_mask[0, 0, :, :] =
  [[1, 1, 0, 1, 0, 1, 1],    ← 变量 0 关注 {0,1,3,5,6}
   [1, 1, 1, 0, 0, 1, 0],    ← 变量 1 关注 {0,1,2,5}
   ...
   [1, 0, 0, 1, 1, 0, 1]]    ← 变量 6 关注 {0,3,4,6}

掩码应用后：
  mask=1 处: scores[b,h,i,j] × 1 + 0 = 原始分数
  mask=0 处: scores[b,h,i,j] × 0 + (-23.03) = -23.03

Step 3 — Softmax + 值加权

python

A = self.dropout(torch.softmax(scale * scores, dim=-1))   # (3, 2, 7, 7)
V = torch.einsum("bhls,bshd->blhd", A, values)            # (3, 7, 2, 4)

softmax 沿最后一维（S=7）：掩码位置 $\exp (- 23.03) \approx 1 e-10$ ，几乎为 0。每行和为 1，只有未屏蔽的通道对贡献实际权重。

toy 数值（mask 后的 softmax，第 0 个 batch、第 0 个 head、变量 0）：

masked_scores[0,0,0,:] = [s0, s1, -23.03, s3, -23.03, s5, s6]
softmax → a ≈ [a0, a1, ≈0, a3, ≈0, a5, a6]，有效权重仅在 5 个位置

Step 4 — 输出投影

python

out = out.view(B, L, -1)              # (3, 7, 2, 4) → (3, 7, 8)
return self.out_projection(out), attn  # Linear(8, 8) → (3, 7, 8)

§5.4 `EncoderLayer`：残差 + FFN

python

def forward(self, x, attn_mask=None, ...):
    new_x, attn = self.attention(x, x, x, attn_mask=attn_mask, ...)
    x = x + self.dropout(new_x)           # 第 1 条残差

    y = x = self.norm1(x)                 # norm1，同时赋给 x 和 y
    y = self.dropout(self.activation(self.conv1(y.transpose(-1, 1))))
    y = self.dropout(self.conv2(y).transpose(-1, 1))

    return self.norm2(x + y), attn        # 第 2 条残差 + norm2

第 1 条残差：

x_input (3,7,8)
   ↓ attention (掩码自注意力)
new_x (3,7,8)
   ↓ dropout
x = x_input + dropout(new_x)   → (3,7,8)

FFN 的 Conv1d 与 transpose：

nn.Conv1d 要求输入格式 (B, C_in, L)。当前 x 格式是 (B, N, d_model) = (3, 7, 8)，这里 N=7 是"序列长度"、d_model=8 是"通道"：

y.transpose(-1, 1):
  (3, 7, 8) → (3, 8, 7)   ← C=8, L=7 （Conv1d 格式）

conv1 = Conv1d(8→32, kernel=1):
  (3, 8, 7) → (3, 32, 7)  ← point-wise（kernel=1 等价于 Linear）

activation + dropout → (3, 32, 7)

conv2 = Conv1d(32→8, kernel=1):
  (3, 32, 7) → (3, 8, 7)

.transpose(-1, 1):
  (3, 8, 7) → (3, 7, 8)   ← 还原格式

Conv1d(kernel=1) ≡ Point-wise Linear

当 kernel_size=1 时，Conv1d 对每个位置独立应用相同的线性变换，等价于 nn.Linear 作用于 d_ff 维。使用 Conv1d 的原因是历史遗留（Transformer 最初的 FFN 用 Conv1d 实现），在语义上与 Linear 完全等价，但需要额外两次 transpose。

第 2 条残差：

x (来自 norm1，赋值技巧: y = x = norm1(x))
   + dropout(y)                         ← FFN 输出
→ norm2(x + y)                          → (3, 7, 8)

y = x = self.norm1(x) 的引用语义

这行代码先计算 norm1(x) 赋给 x（覆盖原来的 x），同时也赋给 y。此时 x 和 y 指向同一个 tensor。接下来 y = ... 重新绑定 y 到 FFN 输出，但 x 仍然是 norm1 的输出。最终 x + y = norm1 输出 + FFN 输出，实现了第 2 条残差连接。这是常见的 Python 引用复用技巧，不是原地修改。

toy 数值追踪（单层 EncoderLayer，B=0, token=变量 0）：

x[0, 0, :] = [f0, f1, ..., f7]  ← 变量 0 的 MoE 特征（8 维）

注意力后 new_x[0, 0, :] = 被相关变量信息加权的混合特征（8 维）
x → x + dropout(new_x) → norm1 → y 同步

FFN: (8) → Conv1d(k=1) → (32) → GELU → (32) → Conv1d(k=1) → (8)
最终 norm2(x + y): 维持 (3, 7, 8)

§5.5 Encoder 循环：e_layers=2

DUET 配置中 e_layers=2，conv_layers=None：

python

for attn_layer in self.attn_layers:   # 循环 2 次
    x, attn = attn_layer(x, attn_mask=attn_mask, ...)
    attns.append(attn)
if self.norm is not None:
    x = self.norm(x)                  # LayerNorm(8)
return x, attns

shape 全程保持 (3, 7, 8)，掩码在两层中复用同一个 attn_mask (3, 1, 7, 7)。

attns 是长度为 2 的列表，每项是 None（因为 output_attention=0 → False）。DUETModel.forward() 接收该列表但用变量名 attention 接收后丢弃不使用。

§6 下钻子组件

本层四个类（Encoder / EncoderLayer / AttentionLayer / FullAttention）均已在 §5 完整精读，无需另开文档。

创建：2026-04-24

DLinear_v1_archive

Informer_v1_archive

PatchTST_v1_archive

12-SelfAttention_Family

01-DLinear

02-PatchTST

03-Informer

DUET · Layer 2C — Channel_transformer（带掩码变量 Transformer）

§1 在父层中的位置

§2 I/O 接口定义

§3 顺序图（具体层）

§4 语义分组图（索引层）

§5 逐步骤精读

§5.0 完整原始代码

§5.1 宏观逻辑

§5.2 `AttentionLayer`：Q/K/V 投影与多头拆分

§5.3 `FullAttention`：带掩码的注意力计算

§5.4 `EncoderLayer`：残差 + FFN

§5.5 Encoder 循环：e_layers=2

§6 下钻子组件

DUET · Layer 2C — Channel_transformer（带掩码变量 Transformer） ​

§1 在父层中的位置 ​

§2 I/O 接口定义 ​

§3 顺序图（具体层） ​

§4 语义分组图（索引层） ​

§5 逐步骤精读 ​

§5.0 完整原始代码 ​

§5.1 宏观逻辑 ​

§5.2 AttentionLayer：Q/K/V 投影与多头拆分 ​

§5.3 FullAttention：带掩码的注意力计算 ​

§5.4 EncoderLayer：残差 + FFN ​

§5.5 Encoder 循环：e_layers=2 ​

§6 下钻子组件 ​

DUET · Layer 2C — Channel_transformer（带掩码变量 Transformer）

§1 在父层中的位置

§2 I/O 接口定义

§3 顺序图（具体层）

§4 语义分组图（索引层）

§5 逐步骤精读

§5.0 完整原始代码

§5.1 宏观逻辑

§5.2 `AttentionLayer`：Q/K/V 投影与多头拆分

§5.3 `FullAttention`：带掩码的注意力计算

§5.4 `EncoderLayer`：残差 + FFN

§5.5 Encoder 循环：e_layers=2

§6 下钻子组件