Skip to content

03A-Layer2A-DataEmbedding

本文件位置

上层:[[02-Layer1-forecast主链]]
入口代码:enc_out = self.enc_embedding(x_enc, x_mark_enc)
入口函数:DataEmbedding.forward(x, x_mark)
出口张量:enc_out,形状从 (B, seq_len, enc_in) 变成 (B, seq_len, d_model)

1. 本层顺序树

1.1 语义分组图

2. 输入输出接口

变量toy shape含义
x(3,8,4)归一化后的历史数值序列
x_mark(3,8,4)时间特征,小时频率下通常对应 month/day/weekday/hour 的连续编码
value_embedding(x)(3,8,6)Conv1d 把 4 个变量映射到 6 个 hidden channel
temporal_embedding(x_mark)(3,8,6)Linear(4 -> 6) 把时间特征映射到 hidden channel
position_embedding(x)(1,8,6)正弦余弦位置编码,可广播到 batch
output(3,8,6)三种信息相加后的 embedding

3. 对照源码

位置:ts_benchmark/baselines/time_series_library/layers/Embed.py

python
class DataEmbedding(nn.Module):
    def __init__(self, c_in, d_model, embed_type="fixed", freq="h", dropout=0.1):
        super(DataEmbedding, self).__init__()

        self.value_embedding = TokenEmbedding(c_in=c_in, d_model=d_model)
        self.position_embedding = PositionalEmbedding(d_model=d_model)
        self.temporal_embedding = (
            TemporalEmbedding(d_model=d_model, embed_type=embed_type, freq=freq)
            if embed_type != "timeF"
            else TimeFeatureEmbedding(d_model=d_model, embed_type=embed_type, freq=freq)
        )
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, x, x_mark):
        if x_mark is None:
            x = self.value_embedding(x) + self.position_embedding(x)
        else:
            x = (
                self.value_embedding(x)
                + self.temporal_embedding(x_mark)
                + self.position_embedding(x)
            )
        return self.dropout(x)

TimesNet 在 TFB 的 transformer_adapter 中通常走 embed_type="timeF",所以 temporal_embeddingTimeFeatureEmbedding

4. value_embedding:数值进入 hidden 空间

源码:

python
class TokenEmbedding(nn.Module):
    def __init__(self, c_in, d_model):
        super(TokenEmbedding, self).__init__()
        padding = 1 if torch.__version__ >= "1.5.0" else 2
        self.tokenConv = nn.Conv1d(
            in_channels=c_in,
            out_channels=d_model,
            kernel_size=3,
            padding=padding,
            padding_mode="circular",
            bias=False,
        )
        for m in self.modules():
            if isinstance(m, nn.Conv1d):
                nn.init.kaiming_normal_(
                    m.weight, mode="fan_in", nonlinearity="leaky_relu"
                )

    def forward(self, x):
        x = self.tokenConv(x.permute(0, 2, 1)).transpose(1, 2)
        return x

shape 流水线:

text
x:                  (3,8,4)
x.permute(0,2,1):   (3,4,8)
Conv1d(4 -> 6):     (3,6,8)
transpose(1,2):     (3,8,6)

Conv1d 的数学职责:对每个时间点附近的长度 3 邻域做局部线性组合,同时把变量维 4 映射到 hidden 维 6

toy 例子只看一个 batch、一个输出通道、时间位置 t=3。全局 toy 有 4 个变量,卷积核长度 3,取便于手算的权重:

text
变量0在 t=2,3,4 的值: [1, 2, 3]
变量1在 t=2,3,4 的值: [4, 5, 6]
变量2在 t=2,3,4 的值: [7, 8, 9]
变量3在 t=2,3,4 的值: [10, 11, 12]

out_channel0 的卷积核:
变量0权重: [0.10, 0.10, 0.10]
变量1权重: [0.20, 0.20, 0.20]
变量2权重: [0.05, 0.05, 0.05]
变量3权重: [0.01, 0.01, 0.01]

输出 =
(1+2+3)*0.10 + (4+5+6)*0.20 + (7+8+9)*0.05 + (10+11+12)*0.01
= 0.60 + 3.00 + 1.20 + 0.33
= 5.13

真实运行时权重来自 kaiming_normal_ 初始化并在训练中更新,计算规则与上面的 toy 完全一致。

5. temporal_embedding:时间特征进入 hidden 空间

源码:

python
class TimeFeatureEmbedding(nn.Module):
    def __init__(self, d_model, embed_type="timeF", freq="h"):
        super(TimeFeatureEmbedding, self).__init__()

        freq_map = {"h": 4, "t": 5, "s": 6, "m": 1, "a": 1, "w": 2, "d": 3, "b": 3}
        d_inp = freq_map[freq]
        self.embed = nn.Linear(d_inp, d_model, bias=False)

    def forward(self, x):
        return self.embed(x)

小时频率 freq="h" 时:

text
x_mark:             (3,8,4)
Linear(4 -> 6):     (3,8,6)

toy 例子只看一个时间点:

text
x_mark[0,3,:] = [0.25, 0.50, 0.75, 1.00]

若第0个 hidden 通道权重为:
w = [1.0, -1.0, 0.5, 2.0]

temporal_embedding[0,3,0]
= 0.25*1.0 + 0.50*(-1.0) + 0.75*0.5 + 1.00*2.0
= 2.125

6. position_embedding:位置信息进入 hidden 空间

源码:

python
class PositionalEmbedding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEmbedding, self).__init__()
        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model).float()
        pe.require_grad = False

        position = torch.arange(0, max_len).float().unsqueeze(1)
        div_term = (
            torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)
        ).exp()

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        pe = pe.unsqueeze(0)
        self.register_buffer("pe", pe)

    def forward(self, x):
        return self.pe[:, : x.size(1)]

公式:

PE(pos,2i)=sin(pos/100002i/dmodel)PE(pos,2i+1)=cos(pos/100002i/dmodel)

shape:

text
self.pe:            (1,5000,6)
self.pe[:, :8]:     (1,8,6)
广播到 batch:        (3,8,6)

7. 三路相加

源码对应:

python
x = (
    self.value_embedding(x)
    + self.temporal_embedding(x_mark)
    + self.position_embedding(x)
)
return self.dropout(x)

toy 例子只看 batch=0, time=3, hidden=0

text
value_embedding[0,3,0]    = 5.130
temporal_embedding[0,3,0] = 2.125
position_embedding[0,3,0] = sin(3) ≈ 0.141

sum = 5.130 + 2.125 + 0.141 = 7.396
dropout 后:
训练模式可能置零或按 1/(1-p) 缩放
评估模式保持 7.396

8. 出口接回上层

text
DataEmbedding 输出 enc_out: (3,8,6)
回到 [[02-Layer1-forecast主链]]
下一步: predict_linear 把时间长度 8 扩展到 13

*记录并在线阅读我的笔记*