如何区分并记住常见的几种 Normalization 算法

科技2024-05-31 88

参考链接：

https://zhuanlan.zhihu.com/p/69659844

https://www.cnblogs.com/cxq1126/p/13299859.html

Batch Normalization (BN)

举例说明：

Tip：

Layer Normalization (LN)

Instance Normalization (IN)

Group Normalization (GN)

总结

神经网络中有各种归一化算法：Batch Normalization (BN)、Layer Normalization (LN)、Instance Normalization (IN)、Group Normalization (GN)。从公式看它们都差不多，如 (1) 所示：无非是减去均值，除以标准差，再施以线性映射。

Batch Normalization (BN)

# coding=utf8 import torch from torch import nn # track_running_stats=False，求当前 batch 真实平均值和标准差， # 而不是更新全局平均值和标准差 # affine=False, 只做归一化，不乘以 gamma 加 beta（通过训练才能确定） # num_features 为 feature map 的 channel 数目 # eps 设为 0，让官方代码和我们自己的代码结果尽量接近 bn = nn.BatchNorm2d(num_features=3, eps=0, affine=False, track_running_stats=False) # 乘 10000 为了扩大数值，如果出现不一致，差别更明显 x = torch.rand(10, 3, 5, 5) * 10000 official_bn = bn(x) # 把 channel 维度单独提出来，而把其它需要求均值和标准差的维度融合到一起 x1 = x.permute(1, 0, 2, 3).contiguous().view(3, -1) mu = x1.mean(dim=1).view(1, 3, 1, 1) # unbiased=False, 求方差时不做无偏估计（除以 N-1 而不是 N），和原始论文一致 # 个人感觉无偏估计仅仅是数学上好看，实际应用中差别不大 std = x1.std(dim=1, unbiased=False).view(1, 3, 1, 1) my_bn = (x - mu) / std diff = (official_bn - my_bn).sum() print('diff={}'.format(diff)) # 差别是 10-5 级的，证明和官方版本基本一致

举例说明：

输入数据是6张3通道784个像素点的数据，将其分到三个通道上，在每个通道上也就是[6, 784]的数据，然后分别得到和通道数一样多的统计数据均值μ和方差σ，将每个像素值减去μ除以σ也就变换到了接近N(0,1)的分布，后面又使用参数β和γ将其变换到接近N(β,γ)的分布。

μ和σ只是样本中的统计数据，是没有梯度信息的，不过会保存在运行时参数里。而γ和β属于要训练的参数，他们是有梯度信息的。

import torch from torch import nn x = torch.rand(100, 16, 784) #100张16通道784像素点的数据，均匀分布 layer = nn.BatchNorm1d(16) #传入通道数，因为H和W已经flatten过了，所以用1d out = layer(x) print(layer.running_mean) #tensor([0.0499, 0.0501, 0.0501, 0.0501, 0.0501, 0.0502, 0.0500, 0.0499, 0.0499, # 0.0501, 0.0500, 0.0500, 0.0500, 0.0501, 0.0500, 0.0500]) print(layer.running_var) #tensor([0.9083, 0.9083, 0.9083, 0.9084, 0.9083, 0.9083, 0.9084, 0.9083, 0.9083, # 0.9083, 0.9083, 0.9083, 0.9084, 0.9084, 0.9083, 0.9083]) import torch from torch import nn x = torch.rand(1, 16, 7, 7) #1张16通道的7*7的图像 layer = nn.BatchNorm2d(16) #传入通道数（必须和上面的通道数目一致） out = layer(x) print(out.shape) #torch.Size([1, 16, 7, 7]) print(layer.running_mean) print(layer.running_var) print(layer.weight.shape) #torch.Size([16])对应上面的γ print(layer.bias.shape) #torch.Size([16])对应上面的β print(vars(layer)) #查看网络中一个层上的所有参数 # {'training': True, # '_parameters': # OrderedDict([('weight', Parameter containing: # tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], requires_grad=True)), # ('bias', Parameter containing: # tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], requires_grad=True))]), # '_buffers': # OrderedDict([('running_mean', tensor([0.0527, 0.0616, 0.0513, 0.0488, 0.0484, 0.0510, 0.0590, 0.0459, 0.0448, 0.0586, 0.0535, 0.0464, 0.0581, 0.0481, 0.0420, 0.0549])), # ('running_var', tensor([0.9089, 0.9075, 0.9082, 0.9079, 0.9096, 0.9098, 0.9079, 0.9086, 0.9081, 0.9075, 0.9052, 0.9081, 0.9093, 0.9075, 0.9086, 0.9073])), # ('num_batches_tracked', tensor(1))]), # '_backward_hooks': OrderedDict(), # '_forward_hooks': OrderedDict(), # '_forward_pre_hooks': OrderedDict(), # '_state_dict_hooks': OrderedDict(), # '_load_state_dict_pre_hooks': OrderedDict(), # '_modules': OrderedDict(), # 'num_features': 16, # 'eps': 1e-05, # 'momentum': 0.1, # 'affine': True, # 'track_running_stats': True}

Tip：

layer.weight和layer.bias是当前batch上的；

如果在定义层时使用了参数affine=False，那么就是固定γ=1和β=0不自动学习，这时参数layer.weight和layer.bias将是None。

Layer Normalization (LN)

import torch from torch import nn x = torch.rand(10, 3, 5, 5)*10000 # normalization_shape 相当于告诉程序这本书有多少页，每页多少行多少列 # eps=0 排除干扰 # elementwise_affine=False 不作映射 # 这里的映射和 BN 以及下文的 IN 有区别，它是 elementwise 的 affine， # 即 gamma 和 beta 不是 channel 维的向量，而是维度等于 normalized_shape 的矩阵 ln = nn.LayerNorm(normalized_shape=[3, 5, 5], eps=0, elementwise_affine=False) official_ln = ln(x) x1 = x.view(10, -1) mu = x1.mean(dim=1).view(10, 1, 1, 1) std = x1.std(dim=1,unbiased=False).view(10, 1, 1, 1) my_ln = (x-mu)/std diff = (my_ln-official_ln).sum() print('diff={}'.format(diff)) # 差别和官方版本数量级在 1e-5

Instance Normalization (IN)

import torch from torch import nn x = torch.rand(10, 3, 5, 5) * 10000 # track_running_stats=False，求当前 batch 真实平均值和标准差， # 而不是更新全局平均值和标准差 # affine=False, 只做归一化，不乘以 gamma 加 beta（通过训练才能确定） # num_features 为 feature map 的 channel 数目 # eps 设为 0，让官方代码和我们自己的代码结果尽量接近 In = nn.InstanceNorm2d(num_features=3, eps=0, affine=False, track_running_stats=False) official_in = In(x) x1 = x.view(30, -1) mu = x1.mean(dim=1).view(10, 3, 1, 1) std = x1.std(dim=1, unbiased=False).view(10, 3, 1, 1) my_in = (x-mu)/std diff = (my_in-official_in).sum() print('diff={}'.format(diff)) # 误差量级在 1e-5

Group Normalization (GN)

Group Normalization (GN) 适用于占用显存比较大的任务，例如图像分割。对这类任务，可能 batchsize 只能是个位数，再大显存就不够用了。而当 batchsize 是个位数时，BN 的表现很差，因为没办法通过几个样本的数据量，来近似总体的均值和标准差。GN 也是独立于 batch 的，它是 LN 和 IN 的折中。正如提出该算法的论文展示的：

import torch from torch import nn x = torch.rand(10, 20, 5, 5)*10000 # 分成 4 个 group # 其余设定和之前相同 gn = nn.GroupNorm(num_groups=4, num_channels=20, eps=0, affine=False) official_gn = gn(x) # 把同一 group 的元素融合到一起 x1 = x.view(10, 4, -1) mu = x1.mean(dim=-1).reshape(10, 4, -1) std = x1.std(dim=-1).reshape(10, 4, -1) x1_norm = (x1-mu)/std my_gn = x1_norm.reshape(10, 20, 5, 5) diff = (my_gn-official_gn).sum() print('diff={}'.format(diff)) # 误差在 1e-4 级

总结

除了上面这些归一化方法，还有基于它们发展出来的算法，例如 Conditional BatchNormalization 和 AdaIN，可以分别参考下面的博客：

https://zhuanlan.zhihu.com/p/61248211

https://zhuanlan.zhihu.com/p/57875010

Processed: 0.013, SQL: 8