Appearance
Embedding 与位置编码:Informer、PatchTST 的输入表示
Abstract
这篇只讲输入表示层:
模型不是直接拿原始数值做 attention,而是先把数值、时间特征、位置信息变成同一个
d_model空间里的 token。
0. 文件索引
| 项目 | 内容 |
|---|---|
| 源文件 | ts_benchmark/baselines/time_series_library/layers/Embed.py |
| 覆盖类 | DataEmbedding / TokenEmbedding / PositionalEmbedding / TemporalEmbedding / PatchEmbedding |
| 覆盖模型 | Informer / PatchTST |
| 核心输出 | (..., d_model) |
1. Level 1:Informer 的 DataEmbedding
源码:
python
class DataEmbedding(nn.Module):
def __init__(self, c_in, d_model, embed_type="fixed", freq="h", dropout=0.1):
super(DataEmbedding, self).__init__()
self.value_embedding = TokenEmbedding(c_in=c_in, d_model=d_model)
self.position_embedding = PositionalEmbedding(d_model=d_model)
self.temporal_embedding = (
TemporalEmbedding(d_model=d_model, embed_type=embed_type, freq=freq)
if embed_type != "timeF"
else TimeFeatureEmbedding(d_model=d_model, embed_type=embed_type, freq=freq)
)
self.dropout = nn.Dropout(p=dropout)
def forward(self, x, x_mark):
if x_mark is None:
x = self.value_embedding(x) + self.position_embedding(x)
else:
x = (
self.value_embedding(x)
+ self.temporal_embedding(x_mark)
+ self.position_embedding(x)
)
return self.dropout(x)Informer 的输入表示是三者相加:
text
value_embedding
+ temporal_embedding
+ position_embeddingtoy:
text
x: (B, L, C) = (2, 6, 3)
x_mark: (B, L, timedim) = (2, 6, 4)
d_model = 16
输出:
(2, 6, 16)2. Level 2:TokenEmbedding
源码:
python
class TokenEmbedding(nn.Module):
def __init__(self, c_in, d_model):
super(TokenEmbedding, self).__init__()
padding = 1 if torch.__version__ >= "1.5.0" else 2
self.tokenConv = nn.Conv1d(
in_channels=c_in,
out_channels=d_model,
kernel_size=3,
padding=padding,
padding_mode="circular",
bias=False,
)
def forward(self, x):
x = self.tokenConv(x.permute(0, 2, 1)).transpose(1, 2)
return xshape:
text
(B, L, C) -> permute -> (B, C, L)
Conv1d(C -> d_model) -> (B, d_model, L)
transpose -> (B, L, d_model)这一步把原始变量值投影成模型 hidden 表示。
3. Level 3:TemporalEmbedding
源码结构:
python
class TemporalEmbedding(nn.Module):
def __init__(self, d_model, embed_type="fixed", freq="h"):
super(TemporalEmbedding, self).__init__()
minute_size = 4
hour_size = 24
weekday_size = 7
day_size = 32
month_size = 13
Embed = FixedEmbedding if embed_type == "fixed" else nn.Embedding
if freq == "t":
self.minute_embed = Embed(minute_size, d_model)
self.hour_embed = Embed(hour_size, d_model)
self.weekday_embed = Embed(weekday_size, d_model)
self.day_embed = Embed(day_size, d_model)
self.month_embed = Embed(month_size, d_model)它处理的是时间戳离散字段:
text
month / day / weekday / hour / minute例如 hour:
text
hour = 13
nn.Embedding(24, d_model)
-> 查表得到一个 d_model 维向量所以 nn.Embedding(num_embeddings, embedding_dim) 可以理解成:
一个可学习或固定的查表函数,把离散编号变成向量。
4. Level 4:PositionalEmbedding
源码:
python
class PositionalEmbedding(nn.Module):
def __init__(self, d_model, max_len=5000):
super(PositionalEmbedding, self).__init__()
pe = torch.zeros(max_len, d_model).float()
pe.require_grad = False
position = torch.arange(0, max_len).float().unsqueeze(1)
div_term = (
torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)
).exp()
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.register_buffer("pe", pe)
def forward(self, x):
return self.pe[:, : x.size(1)]输出:
text
self.pe.shape = (1, max_len, d_model)
forward 返回 (1, L, d_model)它会和 value embedding 广播相加:
text
(B, L, d_model) + (1, L, d_model) -> (B, L, d_model)register_buffer 的意思是:
pe不是可训练参数,但会跟着模型一起保存、加载、移动到 GPU。
5. Level 5:PatchTST 的 PatchEmbedding
PatchTST 不直接用原始时间步作为 token,而是先切 patch。
Note
ReplicationPad1d和unfold的专门下钻在: [[../02-PatchTST/01-ReplicationPad1d与unfold-PatchTST-PatchEmbedding|01-ReplicationPad1d与unfold-PatchTST-PatchEmbedding]]本节先按
PatchEmbedding.forward的顺序,把它们放回 embedding 主线里讲。
源码:
python
class PatchEmbedding(nn.Module):
def __init__(self, d_model, patch_len, stride, padding, dropout):
super(PatchEmbedding, self).__init__()
self.patch_len = patch_len
self.stride = stride
self.padding_patch_layer = nn.ReplicationPad1d((0, padding))
self.value_embedding = nn.Linear(patch_len, d_model, bias=False)
self.position_embedding = PositionalEmbedding(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
n_vars = x.shape[1]
x = self.padding_patch_layer(x)
x = x.unfold(dimension=-1, size=self.patch_len, step=self.stride)
x = torch.reshape(x, (x.shape[0] * x.shape[1], x.shape[2], x.shape[3]))
x = self.value_embedding(x) + self.position_embedding(x)
return self.dropout(x), n_vars5.1 先固定一个更小的 toy example
为了看清每一步,这里不用大 shape,改用:
text
B = 1
C = 2
T = 6
patch_len = 3
stride = 2
padding = 2
d_model = 4进入 PatchEmbedding.forward(x) 时,x 已经是:
text
x.shape = (B, C, T) = (1, 2, 6)具体数值:
python
x = torch.tensor([
[
[1., 2., 3., 4., 5., 6.], # 变量 0
[10., 20., 30., 40., 50., 60.], # 变量 1
]
])也就是:
text
batch 0:
变量 0: [1, 2, 3, 4, 5, 6]
变量 1: [10, 20, 30, 40, 50, 60]5.2 n_vars = x.shape[1]
代码:
python
n_vars = x.shape[1]因为:
text
x.shape = (1, 2, 6)所以:
text
n_vars = 2这里的 n_vars 是变量个数,也就是 C。后面 PatchTST 会把 B*C 合并,encoder 之后还要靠 n_vars 把它还原回来。
5.3 x = self.padding_patch_layer(x)
初始化里:
python
self.padding_patch_layer = nn.ReplicationPad1d((0, padding))当前:
text
padding = 2ReplicationPad1d((0, 2)) 的意思是:
text
左边补 0 个
右边补 2 个
补的值不是 0,而是复制右边界最后一个值输入:
text
x.shape = (1, 2, 6)padding 后:
text
x.shape = (1, 2, 8)具体值:
text
变量 0:
[1, 2, 3, 4, 5, 6]
-> [1, 2, 3, 4, 5, 6, 6, 6]
变量 1:
[10, 20, 30, 40, 50, 60]
-> [10, 20, 30, 40, 50, 60, 60, 60]为什么只右边补?
PatchTST 是从左到右切 patch。前面的 patch 不缺值,只有最后一个 patch 可能不够长,所以只在右端补。
5.4 x = x.unfold(dimension=-1, size=patch_len, step=stride)
代码:
python
x = x.unfold(dimension=-1, size=self.patch_len, step=self.stride)当前:
text
dimension = -1 # 最后一维,也就是时间维
size = patch_len = 3
step = stride = 2输入:
text
x.shape = (1, 2, 8)输出:
text
x.shape = (1, 2, 3, 3)四个维度分别是:
text
(B, C, patch_num, patch_len)为什么 patch_num = 3?
text
patch_num = floor((T_padded - patch_len) / stride) + 1
= floor((8 - 3) / 2) + 1
= floor(2.5) + 1
= 3具体切出来的 patch:
text
变量 0 padded:
[1, 2, 3, 4, 5, 6, 6, 6]
Patch 0 起点 0: [1, 2, 3]
Patch 1 起点 2: [3, 4, 5]
Patch 2 起点 4: [5, 6, 6]
变量 1 padded:
[10, 20, 30, 40, 50, 60, 60, 60]
Patch 0 起点 0: [10, 20, 30]
Patch 1 起点 2: [30, 40, 50]
Patch 2 起点 4: [50, 60, 60]所以此时张量可以理解成:
text
x[batch=0, var=0] =
[
[1, 2, 3],
[3, 4, 5],
[5, 6, 6],
]
x[batch=0, var=1] =
[
[10, 20, 30],
[30, 40, 50],
[50, 60, 60],
]5.5 torch.reshape: 把变量维并入 batch
代码:
python
x = torch.reshape(x, (x.shape[0] * x.shape[1], x.shape[2], x.shape[3]))当前:
text
x.shape[0] = B = 1
x.shape[1] = C = 2
x.shape[2] = patch_num = 3
x.shape[3] = patch_len = 3所以:
text
(1, 2, 3, 3)
-> (1*2, 3, 3)
-> (2, 3, 3)reshape 后:
text
x[0] = batch 0 的变量 0:
[
[1, 2, 3],
[3, 4, 5],
[5, 6, 6],
]
x[1] = batch 0 的变量 1:
[
[10, 20, 30],
[30, 40, 50],
[50, 60, 60],
]这一步是 PatchTST 的 channel-independent 关键:
把每个变量当成一条独立序列,让后面的 Transformer 只在同一个变量自己的 patches 之间做 attention。
5.6 value_embedding: 每个 patch 变成一个 token 向量
初始化里:
python
self.value_embedding = nn.Linear(patch_len, d_model, bias=False)当前:
text
patch_len = 3
d_model = 4所以:
text
Linear(3, 4)输入:
text
x.shape = (B*C, patch_num, patch_len) = (2, 3, 3)输出:
text
value_embedding(x).shape = (B*C, patch_num, d_model) = (2, 3, 4)这不是只改 shape,它会做一次可学习的线性投影。
为了看懂它,假设一个假的 Linear 权重:
text
W =
[
[1, 0, 0 ],
[0, 1, 0 ],
[0, 0, 1 ],
[1/3, 1/3, 1/3],
]对变量 0 的第 0 个 patch:
text
patch = [1, 2, 3]输出 4 维 token:
text
token[0] = 1*1 + 0*2 + 0*3 = 1
token[1] = 0*1 + 1*2 + 0*3 = 2
token[2] = 0*1 + 0*2 + 1*3 = 3
token[3] = (1 + 2 + 3) / 3 = 2
value_token = [1, 2, 3, 2]真实模型里的权重不是这个固定值,而是训练出来的。
核心意思是:
一个长度为
patch_len的局部时间片段,会被 Linear 投影成一个长度为d_model的 token 表示。
5.7 position_embedding: 给每个 patch 加上位置
代码:
python
x = self.value_embedding(x) + self.position_embedding(x)此时:
text
self.value_embedding(x).shape = (2, 3, 4)
self.position_embedding(x).shape = (1, 3, 4)为什么 position embedding 是 (1, 3, 4)?
text
1 = 所有 batch/变量共享同一套位置编码
3 = patch_num
4 = d_model它会广播到:
text
(2, 3, 4)举一个假的位置编码:
text
pos 0: [0, 1, 0, 1]
pos 1: [0.8, 0.5, 0.1, 0.9]
pos 2: [0.9, -0.4, 0.2, 0.8]对变量 0 的第 0 个 patch:
text
value_token = [1, 2, 3, 2]
pos_0 = [0, 1, 0, 1]
final_token = [1, 3, 3, 3]这一步告诉模型:
这个 token 不只是某个 patch 的内容,它还是第几个 patch。
5.8 dropout 和返回值
最后:
python
return self.dropout(x), n_vars返回两个东西:
text
self.dropout(x): (B*C, patch_num, d_model) = (2, 3, 4)
n_vars: 2dropout 不改变 shape。
n_vars 后面用于把 B*C 还原:
text
(B*C, patch_num, d_model)
-> reshape(-1, n_vars, patch_num, d_model)
-> (B, C, patch_num, d_model)5.9 完整流程压缩图
text
输入 x:
(B,C,T) = (1,2,6)
ReplicationPad1d((0,2)):
(1,2,6) -> (1,2,8)
unfold(size=3, step=2):
(1,2,8) -> (1,2,3,3)
(B,C,patch_num,patch_len)
reshape(B*C,...):
(1,2,3,3) -> (2,3,3)
Linear(patch_len=3, d_model=4):
(2,3,3) -> (2,3,4)
position_embedding:
(1,3,4),广播相加
dropout:
(2,3,4)6. Level 6:为什么要加 position embedding
Attention 本身不天然知道顺序。
如果只看一组 token:
text
[patch0, patch1, patch2]不加位置编码时,模型很难知道哪个 patch 靠前、哪个 patch 靠后。
位置编码提供:
text
第 0 个位置
第 1 个位置
第 2 个位置
...所以 token 最终变成:
text
内容信息 + 位置信息7. 常见错误
7.1 把 embedding 只理解成 NLP 里的词表
在时序模型里,embedding 不只表示词表查表。
它可以是:
text
Conv1d 数值投影
Linear patch 投影
时间字段查表
正余弦位置编码7.2 忘记所有 embedding 最后都要对齐到 d_model
Informer 中:
text
value_embedding: (B,L,d_model)
temporal_embedding: (B,L,d_model)
position_embedding: (1,L,d_model)它们能相加,是因为最后一维都是 d_model。
8. 一句话总结
Embedding 层的统一理解是:
把原始数值、时间字段和位置编号,全部变成
d_model维 token,让后面的 attention 能在统一表示空间里计算。