LayerNorm、BatchNorm1d、Dropout：Transformer 层里的稳定化函数

Abstract

这篇只讲三类稳定化函数：
LayerNorm 让每个 token 的 hidden 维稳定，BatchNorm1d 让卷积通道稳定，Dropout 在训练时随机置零防止过拟合。

0. 文件索引

项目	内容
覆盖函数	`nn.LayerNorm` / `nn.BatchNorm1d` / `nn.Dropout`
主要源码	`Transformer_EncDec.py`
覆盖模型	Informer / PatchTST
核心区别	LayerNorm 看最后一维；BatchNorm1d 看 channel 维；Dropout 不改 shape

1. Level 1：三者在源码里的位置

Transformer EncoderLayer：

python

class EncoderLayer(nn.Module):
    def __init__(self, attention, d_model, d_ff=None, dropout=0.1, activation="relu"):
        super(EncoderLayer, self).__init__()
        d_ff = d_ff or 4 * d_model
        self.attention = attention
        self.conv1 = nn.Conv1d(in_channels=d_model, out_channels=d_ff, kernel_size=1)
        self.conv2 = nn.Conv1d(in_channels=d_ff, out_channels=d_model, kernel_size=1)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        self.activation = F.relu if activation == "relu" else F.gelu

    def forward(self, x, attn_mask=None, tau=None, delta=None):
        new_x, attn = self.attention(x, x, x, attn_mask=attn_mask, tau=tau, delta=delta)
        x = x + self.dropout(new_x)

        y = x = self.norm1(x)
        y = self.dropout(self.activation(self.conv1(y.transpose(-1, 1))))
        y = self.dropout(self.conv2(y).transpose(-1, 1))

        return self.norm2(x + y), attn

Informer ConvLayer：

python

class ConvLayer(nn.Module):
    def __init__(self, c_in):
        super(ConvLayer, self).__init__()
        self.downConv = nn.Conv1d(...)
        self.norm = nn.BatchNorm1d(c_in)
        self.activation = nn.ELU()
        self.maxPool = nn.MaxPool1d(kernel_size=3, stride=2, padding=1)

对 PatchTST 来说，LayerNorm 和 Dropout 主要出现在 EncoderLayer 里面；BatchNorm1d 则更常见于 Informer 的 ConvLayer。下图不画完整网络，而是按本文三个 toy 例子对比它们到底在哪些维度上工作。

2. Level 2：`LayerNorm(d_model)`

LayerNorm 在这些模型里通常处理：

text

x.shape = (B, L, d_model)

源码：

python

self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)

含义：

对每个样本、每个时间位置的 d_model 维向量单独归一化。

toy：

text

x.shape = (2, 6, 16)
LayerNorm(16)
输出 shape = (2, 6, 16)

它不改变 shape。

它改变的是每个 token 内部 hidden 维的均值和方差。

3. Level 3：LayerNorm 的可算小例子

只看一个 token：

text

x[b=0, t=0, :] = [1, 2, 3, 4]

均值：

text

mean = (1 + 2 + 3 + 4) / 4 = 2.5

方差：

text

var = ((1-2.5)^2 + (2-2.5)^2 + (3-2.5)^2 + (4-2.5)^2) / 4
    = 1.25

标准化：

text

(x - mean) / sqrt(var + eps)

所以 LayerNorm 的直觉是：

每个 token 自己内部做一次标准化。

4. Level 4：`BatchNorm1d(c_in)`

Informer 的 ConvLayer 里：

python

self.norm = nn.BatchNorm1d(c_in)

此时输入格式已经是 Conv1d 格式：

text

x.shape = (B, C, L)

toy：

text

x.shape = (2, 16, 6)
BatchNorm1d(16)
输出 shape = (2, 16, 6)

BatchNorm1d(16) 的 16 对应 channel 数。

它会按 channel 统计 batch 和 length 上的均值方差。

直觉：

对每个 channel，把整个 batch 和时间轴上的分布稳定住。

5. Level 5：LayerNorm 和 BatchNorm1d 的区别

函数	典型输入	归一化主要看哪一维	常见位置
`LayerNorm(d_model)`	`(B, L, d_model)`	最后一维 hidden	Transformer 残差后
`BatchNorm1d(C)`	`(B, C, L)`	channel 维，对 batch/length 统计	Conv1d 后

最实用的判断：

text

如果张量是 (B,L,d_model)，优先想到 LayerNorm。
如果张量是 (B,C,L)，并且刚过 Conv1d，可能是 BatchNorm1d。

6. Level 6：`Dropout(p)`

源码：

python

self.dropout = nn.Dropout(dropout)

在 forward 里：

python

x = x + self.dropout(new_x)
y = self.dropout(self.activation(...))
y = self.dropout(...)

Dropout 的作用：

训练时以概率 p 把部分元素置 0，并按比例缩放剩余元素；推理时不随机置零。

它不改变 shape。

toy：

text

x.shape = (2, 6, 16)
dropout(x).shape = (2, 6, 16)

如果 p=0.5，训练时某些元素会被置 0：

text

[1.0, 2.0, 3.0, 4.0]
可能变成
[0.0, 4.0, 0.0, 8.0]

这里的 4.0 和 8.0 是因为 PyTorch 训练时会对保留下来的元素做缩放，保持期望不变。

7. 常见错误

7.1 以为 LayerNorm 会跨 batch 归一化

不会。

LayerNorm(d_model) 对每个 token 自己的最后一维做归一化。

7.2 把 BatchNorm1d 用在 `(B,L,C)` 上

BatchNorm1d(C) 常见输入是：

text

(B, C, L)

如果你拿 (B,L,C) 直接喂进去，它会把 L 当成 channel。

7.3 忘记 Dropout 的训练/推理差异

Dropout 在：

python

model.train()

和：

python

model.eval()

行为不同。

训练时随机置零，推理时不随机置零。

8. 一句话总结

这三个函数的速记：

text

LayerNorm: 每个 token 的最后一维归一化
BatchNorm1d: Conv1d 的 channel 维归一化
Dropout: 训练时随机置零，不改 shape

DLinear_v1_archive

Informer_v1_archive

PatchTST_v1_archive

12-SelfAttention_Family

01-DLinear

02-PatchTST

03-Informer

LayerNorm、BatchNorm1d、Dropout：Transformer 层里的稳定化函数

0. 文件索引

1. Level 1：三者在源码里的位置

2. Level 2：`LayerNorm(d_model)`

3. Level 3：LayerNorm 的可算小例子

4. Level 4：`BatchNorm1d(c_in)`

5. Level 5：LayerNorm 和 BatchNorm1d 的区别

6. Level 6：`Dropout(p)`

7. 常见错误

7.1 以为 LayerNorm 会跨 batch 归一化

7.2 把 BatchNorm1d 用在 `(B,L,C)` 上

7.3 忘记 Dropout 的训练/推理差异

8. 一句话总结

LayerNorm、BatchNorm1d、Dropout：Transformer 层里的稳定化函数 ​

0. 文件索引 ​

1. Level 1：三者在源码里的位置 ​

2. Level 2：LayerNorm(d_model) ​

3. Level 3：LayerNorm 的可算小例子 ​

4. Level 4：BatchNorm1d(c_in) ​

5. Level 5：LayerNorm 和 BatchNorm1d 的区别 ​

6. Level 6：Dropout(p) ​

7. 常见错误 ​

7.1 以为 LayerNorm 会跨 batch 归一化 ​

7.2 把 BatchNorm1d 用在 (B,L,C) 上 ​

7.3 忘记 Dropout 的训练/推理差异 ​

8. 一句话总结 ​

LayerNorm、BatchNorm1d、Dropout：Transformer 层里的稳定化函数

0. 文件索引

1. Level 1：三者在源码里的位置

2. Level 2：`LayerNorm(d_model)`

3. Level 3：LayerNorm 的可算小例子

4. Level 4：`BatchNorm1d(c_in)`

5. Level 5：LayerNorm 和 BatchNorm1d 的区别

6. Level 6：`Dropout(p)`

7. 常见错误

7.1 以为 LayerNorm 会跨 batch 归一化

7.2 把 BatchNorm1d 用在 `(B,L,C)` 上

7.3 忘记 Dropout 的训练/推理差异

8. 一句话总结