Conv1d 与 BCL 格式：Informer、PatchTST 的卷积和 FFN

Abstract

这篇只讲一个关键习惯：
nn.Conv1d 的输入格式是 (B, C, L)，不是时间序列模型里常见的 (B, L, C)。

0. 文件索引

项目	内容
覆盖函数	`nn.Conv1d`
覆盖源码	`Embed.py` / `Transformer_EncDec.py`
覆盖模型	Informer / PatchTST
核心格式	`(B, C, L)`
常见配套操作	`permute(0, 2, 1)` / `transpose(-1, 1)`

1. Level 1：Conv1d 的输入格式

nn.Conv1d 期望输入：

text

(N, C_in, L)

在本文时序语境里：

维度	含义
`N`	batch size
`C_in`	channel / feature / d_model
`L`	序列长度

但是模型主线里经常是：

text

(B, L, C)

所以进入 Conv1d 前经常要换维。

2. Level 2：Informer 的 TokenEmbedding

源码：

python

class TokenEmbedding(nn.Module):
    def __init__(self, c_in, d_model):
        super(TokenEmbedding, self).__init__()
        padding = 1 if torch.__version__ >= "1.5.0" else 2
        self.tokenConv = nn.Conv1d(
            in_channels=c_in,
            out_channels=d_model,
            kernel_size=3,
            padding=padding,
            padding_mode="circular",
            bias=False,
        )

    def forward(self, x):
        x = self.tokenConv(x.permute(0, 2, 1)).transpose(1, 2)
        return x

toy：

text

x.shape = (B, L, C) = (2, 6, 3)
d_model = 16

逐步：

text

x.permute(0, 2, 1):
  (2, 6, 3) -> (2, 3, 6)

tokenConv Conv1d(3 -> 16, kernel=3):
  (2, 3, 6) -> (2, 16, 6)

transpose(1, 2):
  (2, 16, 6) -> (2, 6, 16)

语义：

每个时间步附近的局部窗口被卷积聚合，原始 C=3 个变量被投影成 d_model=16 维表示。

3. Level 3：Informer ConvLayer 的 distilling

源码：

python

class ConvLayer(nn.Module):
    def __init__(self, c_in):
        super(ConvLayer, self).__init__()
        self.downConv = nn.Conv1d(
            in_channels=c_in,
            out_channels=c_in,
            kernel_size=3,
            padding=2,
            padding_mode="circular",
        )
        self.norm = nn.BatchNorm1d(c_in)
        self.activation = nn.ELU()
        self.maxPool = nn.MaxPool1d(kernel_size=3, stride=2, padding=1)

    def forward(self, x):
        x = self.downConv(x.permute(0, 2, 1))
        x = self.norm(x)
        x = self.activation(x)
        x = self.maxPool(x)
        x = x.transpose(1, 2)
        return x

toy：

text

x: (B, L, d_model) = (2, 6, 16)

逐步：

text

permute:
  (2, 6, 16) -> (2, 16, 6)

Conv1d(16 -> 16, kernel=3):
  (2, 16, 6) -> (2, 16, 8)  # circular padding=2 时长度可能先变长

BatchNorm1d + ELU:
  (2, 16, 8) -> (2, 16, 8)

MaxPool1d(kernel=3, stride=2, padding=1):
  (2, 16, 8) -> (2, 16, 4)

transpose:
  (2, 16, 4) -> (2, 4, 16)

这个模块的语义是：

在 encoder 层之间压缩序列长度，也就是 Informer 论文里的 distilling。

4. Level 4：EncoderLayer 里的 `Conv1d(kernel_size=1)`

源码：

python

class EncoderLayer(nn.Module):
    def __init__(self, attention, d_model, d_ff=None, dropout=0.1, activation="relu"):
        super(EncoderLayer, self).__init__()
        d_ff = d_ff or 4 * d_model
        self.attention = attention
        self.conv1 = nn.Conv1d(in_channels=d_model, out_channels=d_ff, kernel_size=1)
        self.conv2 = nn.Conv1d(in_channels=d_ff, out_channels=d_model, kernel_size=1)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        self.activation = F.relu if activation == "relu" else F.gelu

    def forward(self, x, attn_mask=None, tau=None, delta=None):
        new_x, attn = self.attention(x, x, x, attn_mask=attn_mask, tau=tau, delta=delta)
        x = x + self.dropout(new_x)

        y = x = self.norm1(x)
        y = self.dropout(self.activation(self.conv1(y.transpose(-1, 1))))
        y = self.dropout(self.conv2(y).transpose(-1, 1))

        return self.norm2(x + y), attn

在 PatchTST 的 EncoderLayer 里，这里的 Conv1d(kernel_size=1) 位于 attention 之后的 FFN 分支。下图按本文 toy 例子重画：重点不是完整 EncoderLayer，而是 BLC -> BCL -> Conv1d(kernel=1) -> BLC 这条格式链。

toy：

text

x = (B, L, d_model) = (8, 6, 16)
d_ff = 64

FFN 部分：

text

y.transpose(-1, 1):
  (8, 6, 16) -> (8, 16, 6)

conv1 Conv1d(16 -> 64, kernel=1):
  (8, 16, 6) -> (8, 64, 6)

conv2 Conv1d(64 -> 16, kernel=1):
  (8, 64, 6) -> (8, 16, 6)

transpose(-1, 1):
  (8, 16, 6) -> (8, 6, 16)

5. Level 5：为什么 `kernel_size=1` 等价于 position-wise FFN

kernel_size=1 的卷积每次只看当前位置，不看左右邻居。

所以对每个时间位置 l：

text

y[:, :, l] = W * x[:, :, l] + b

它只混合 channel / hidden 维，不混合时间维。

这和 Transformer 原论文里的 position-wise feed-forward network 是同一件事：

text

每个位置独立做:
d_model -> d_ff -> d_model

6. 可算小例子

假设 Conv1d(in_channels=1, out_channels=1, kernel_size=3, bias=False)，权重：

text

w = [0.25, 0.50, 0.25]

输入单通道序列：

text

x = [2, 4, 6, 8, 10]

不考虑 padding 时，第一个输出窗口：

text

[2, 4, 6] -> 2*0.25 + 4*0.50 + 6*0.25 = 4

第二个输出窗口：

text

[4, 6, 8] -> 4*0.25 + 6*0.50 + 8*0.25 = 6

所以卷积可以理解成：

用一个可学习的局部窗口，在时间轴上滑动提取局部模式。

7. 常见错误

7.1 把 `(B, L, C)` 直接送进 Conv1d

错误：

python

self.conv(x)  # x: (B, L, C)

Conv1d 会把 L 当成 channel，把 C 当成序列长度。

正确：

python

self.conv(x.permute(0, 2, 1))

7.2 混淆 `Conv1d(kernel=3)` 和 `Conv1d(kernel=1)`

写法	是否混合时间邻居	常见用途
`Conv1d(kernel_size=3)`	yes	局部时间模式、distill
`Conv1d(kernel_size=1)`	no	position-wise FFN

8. 一句话总结

Conv1d 的核心理解是：

输入必须是 (B,C,L)；kernel_size=3 看局部时间窗口，kernel_size=1 只对每个位置做通道维投影。

DLinear_v1_archive

Informer_v1_archive

PatchTST_v1_archive

12-SelfAttention_Family

01-DLinear

02-PatchTST

03-Informer

Conv1d 与 BCL 格式：Informer、PatchTST 的卷积和 FFN

0. 文件索引

1. Level 1：Conv1d 的输入格式

2. Level 2：Informer 的 TokenEmbedding

3. Level 3：Informer ConvLayer 的 distilling

4. Level 4：EncoderLayer 里的 `Conv1d(kernel_size=1)`

5. Level 5：为什么 `kernel_size=1` 等价于 position-wise FFN

6. 可算小例子

7. 常见错误

7.1 把 `(B, L, C)` 直接送进 Conv1d

7.2 混淆 `Conv1d(kernel=3)` 和 `Conv1d(kernel=1)`

8. 一句话总结

Conv1d 与 BCL 格式：Informer、PatchTST 的卷积和 FFN ​

0. 文件索引 ​

1. Level 1：Conv1d 的输入格式 ​

2. Level 2：Informer 的 TokenEmbedding ​

3. Level 3：Informer ConvLayer 的 distilling ​

4. Level 4：EncoderLayer 里的 Conv1d(kernel_size=1) ​

5. Level 5：为什么 kernel_size=1 等价于 position-wise FFN ​

6. 可算小例子 ​

7. 常见错误 ​

7.1 把 (B, L, C) 直接送进 Conv1d ​

7.2 混淆 Conv1d(kernel=3) 和 Conv1d(kernel=1) ​

8. 一句话总结 ​

Conv1d 与 BCL 格式：Informer、PatchTST 的卷积和 FFN

0. 文件索引

1. Level 1：Conv1d 的输入格式

2. Level 2：Informer 的 TokenEmbedding

3. Level 3：Informer ConvLayer 的 distilling

4. Level 4：EncoderLayer 里的 `Conv1d(kernel_size=1)`

5. Level 5：为什么 `kernel_size=1` 等价于 position-wise FFN

6. 可算小例子

7. 常见错误

7.1 把 `(B, L, C)` 直接送进 Conv1d

7.2 混淆 `Conv1d(kernel=3)` 和 `Conv1d(kernel=1)`

8. 一句话总结