目前深度学习使用的都是非常简单的一阶收敛算法,梯度下降法,不管有多少自适应的优化算法,本质上都是对梯度下降法的各种变形,所以,初始学习率对深层网络的收敛起着决定性的作用,下面就是梯度下降法的公式: α \alpha α 就是学习率,如果学习率太小,会导致网络loss下降非常慢,如果学习率太大,那么参数更新的幅度就非常大,就会导致网络收敛到局部最优点,或不会收敛或者loss直接开始增加;如下图所示: 学习率的选择策略在网络的训练过程中是不断在变化的,在刚开始的时候,参数比较随机,所以我们应该选择相对较大的学习率,这样loss下降更快;当训练一段时间之后,参数的更新就应该是更小的幅度,所以学习率一般会做衰减,衰减的方式也非常多,比如到一定的步数将学习率乘上0.1,也有指数衰减等。
https://zhuanlan.zhihu.com/p/31424275
学习率是优化器类optimizer的一个参数,设置学习率,就是给优化器传递这个参数,通常有两大类:
一是静态常数学习率,可以在构造optimizer的时候传入常数参数即可还有一种是动态设置学习率,mxnet提供lr_scheduler模块来完成动态设置构造一个lrs,作为变量传递给优化器,当做优化器的lr_scheduler参数。
lrs # 需要自己构造学习率调整策略 trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': learning_rate, 'wd': 0.001, 'lr_scheduler' : lrs})对于基类LRScheduler,其参数使用默认的值,在LRScheduler的基础上,增加了几个参数:
step:每隔step次更新后,调整学习率factor:调整因子,每次调整时乘以当前学习率得到调整后的学习率stop_factor_lr:最小学习率,达到这个值后不再更新学习率,可以避免过小学习率base_lr:1,使用这里的配置,可以得到下图调整策略: # Reduce the learning rate by a factor for every n steps. # It returns a new learning rate by: base_lr * pow(factor, floor(num_update/step)) 大概的图像: 源码 def __init__(self, step, factor=1, stop_factor_lr=1e-8, base_lr=0.01,\ warmup_steps=0, warmup_begin_lr=0, warmup_mode='linear'): super(FactorScheduler, self).__init__(base_lr, warmup_steps, warmup_begin_lr, warmup_mode) if step < 1: raise ValueError("Schedule step must be greater or equal than 1 round") if factor > 1.0: raise ValueError("Factor must be no more than 1 to make lr reduce") self.step = step self.factor = factor self.stop_factor_lr = stop_factor_lr self.count = 0 def __call__(self, num_update): if num_update < self.warmup_steps: return self.get_warmup_lr(num_update) # NOTE: use while rather than if (for continuing training via load_epoch) while num_update > self.count + self.step: self.count += self.step self.base_lr *= self.factor if self.base_lr < self.stop_factor_lr: self.base_lr = self.stop_factor_lr logging.info("Update[%d]: now learning rate arrived at %0.5e, will not " "change in the future", num_update, self.base_lr) else: logging.info("Update[%d]: Change learning rate to %0.5e", num_update, self.base_lr) return self.base_lr因子衰减,每隔固定次数调整;多因子衰减,每达到list中的更新次数才调整。
对于基类LRScheduler,其参数使用默认的值,在LRScheduler的基础上,增加了几个参数:
step:与因子衰减相比,多因子衰减最大的区别是step不再是一个整数值,而是一个整数list,更新次数超过list中的一个值时,调整一次,区别只是在于调整的时机不一样factor:调整因子,每次调整时乘以当前学习率得到调整后的学习率stop_factor_lr:最小学习率,达到这个值后不再更新学习率,可以避免过小学习率base_lr:1,使用这里的配置,可以得到下图大概图像: 源码 def __init__(self, step, factor=1, base_lr=0.01, warmup_steps=0, warmup_begin_lr=0, warmup_mode='linear'): super(MultiFactorScheduler, self).__init__(base_lr, warmup_steps, warmup_begin_lr, warmup_mode) assert isinstance(step, list) and len(step) >= 1 for i, _step in enumerate(step): if i != 0 and step[i] <= step[i-1]: raise ValueError("Schedule step must be an increasing integer list") if _step < 1: raise ValueError("Schedule step must be greater or equal than 1 round") if factor > 1.0: raise ValueError("Factor must be no more than 1 to make lr reduce") self.step = step self.cur_step_ind = 0 self.factor = factor self.count = 0 def __call__(self, num_update): if num_update < self.warmup_steps: return self.get_warmup_lr(num_update) # NOTE: use while rather than if (for continuing training via load_epoch) while self.cur_step_ind <= len(self.step)-1: if num_update > self.step[self.cur_step_ind]: self.count = self.step[self.cur_step_ind] self.cur_step_ind += 1 self.base_lr *= self.factor logging.info("Update[%d]: Change learning rate to %0.5e", num_update, self.base_lr) else: return self.base_lr return self.base_lr因子调整是用离散的数据来完成学习率的调整,很自然的能想到可以用连续函数来完成这一工作,而且由于学习率前面调整大,后面调整小,所以用底数位于(0,1)区间的幂函数来调整就非常合适。多项式策略就是用了幂函数来调整学习率。 对于基类LRScheduler,其参数使用默认的值,
warmup_steps = 0,表示不适用warmup模式,相关的参数(如warmup_begin_lr、warmup_mode)使用默认值在LRScheduler的基础上,增加了几个参数:
base_lr:设为1,从0次更新就开始调整,到最大更新次数(max_update = 1000)之后不再变化,如下图final_lr:最终学习率大概图像 源码 def __init__(self, max_update, base_lr=0.01, pwr=2, final_lr=0, warmup_steps=0, warmup_begin_lr=0, warmup_mode='linear'): super(PolyScheduler, self).__init__(base_lr, warmup_steps, warmup_begin_lr, warmup_mode) assert isinstance(max_update, int) if max_update < 1: raise ValueError("maximum number of updates must be strictly positive") self.power = pwr self.base_lr_orig = self.base_lr self.max_update = max_update self.final_lr = final_lr self.max_steps = self.max_update - self.warmup_steps def __call__(self, num_update): if num_update < self.warmup_steps: return self.get_warmup_lr(num_update) if num_update <= self.max_update: self.base_lr = self.final_lr + (self.base_lr_orig - self.final_lr) * \ pow(1 - float(num_update - self.warmup_steps) / float(self.max_steps), self.power) return self.base_lr除了多项式函数,我们也可以利用余弦函数的单调性来完成学习率的调整,参数与多项式函数完全一样,只不过内部实现的函数完成了余弦函数:
base_lr(初始学习率) :1
final_lr(最终学习率) :0.1
num_update(最大更新次数) :1000
warmup_steps :0
大概图像
源码
def __init__(self, max_update, base_lr=0.01, final_lr=0, warmup_steps=0, warmup_begin_lr=0, warmup_mode='linear'): super(CosineScheduler, self).__init__(base_lr, warmup_steps, warmup_begin_lr, warmup_mode) assert isinstance(max_update, int) if max_update < 1: raise ValueError("maximum number of updates must be strictly positive") self.base_lr_orig = base_lr self.max_update = max_update self.final_lr = final_lr self.max_steps = self.max_update - self.warmup_steps def __call__(self, num_update): if num_update < self.warmup_steps: return self.get_warmup_lr(num_update) if num_update <= self.max_update: self.base_lr = self.final_lr + (self.base_lr_orig - self.final_lr) * \ (1 + cos(pi * (num_update - self.warmup_steps) / self.max_steps)) / 2 return self.base_lrhttps://blog.csdn.net/gaussrieman123/article/details/99441123
迁移学习 (Finetune) 中我们经常需要固定pretrained层的学习率,或者把其学习率设置比后面的网络小,这就需要我们对不同的层设置不同的学习率,可以用net.collect_params(‘re’).setattr(‘lr_mult’,ratio)方法实现这个功能(net是一个Model类的对象)。 net.collect_params()将返回一个ParamterDict类型的变量,其中包含了网络中所有参数。
# 函数原型 def collect_params(self,select=None)在函数原型中,select参数可以为一个正则表达式,从而collect_params()只会选择被该正则表达式匹配上的参数,也可以是某一层具体的名称:
model.collect_params('conv1_weight|conv1_bias|fc_weight|fc_bias') model.collect_params('.*weight|.*bias')当我们需要单独设置学习率的时候,需要把单独设置学习率的参数都用正则表达式匹配出来。比如说下面的ResNet50,其所有参数如下(中间省略了一些层):
print(net.collect_params()) resnet50v1 ( Parameter resnet50v1batchnorm0_gamma (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1batchnorm0_beta (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1batchnorm0_running_mean (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1batchnorm0_running_var (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1conv0_weight (shape=(64, 0, 5, 5), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm0_gamma (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm0_beta (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm0_running_mean (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm0_running_var (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_conv0_weight (shape=(64, 0, 1, 1), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm1_gamma (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm1_beta (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm1_running_mean (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm1_running_var (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_conv1_weight (shape=(64, 0, 3, 3), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm2_gamma (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm2_beta (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm2_running_mean (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm2_running_var (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_conv2_weight (shape=(256, 0, 1, 1), dtype=<class 'numpy.float32'>) Parameter resnet50v1conv1_weight (shape=(256, 0, 1, 1), dtype=<class 'numpy.float32'>) Parameter resnet50v1batchnorm1_gamma (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1batchnorm1_beta (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1batchnorm1_running_mean (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1batchnorm1_running_var (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm3_gamma (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm3_beta (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm3_running_mean (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm3_running_var (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_conv3_weight (shape=(64, 0, 1, 1), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm4_gamma (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm4_beta (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm4_running_mean (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm4_running_var (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_conv4_weight (shape=(64, 0, 3, 3), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm5_gamma (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm5_beta (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm5_running_mean (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm5_running_var (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_conv5_weight (shape=(256, 0, 1, 1), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm6_gamma (shape=(0,), dtype=<class 'numpy.float32'>) ..... ..... ..... ..... ..... Parameter resnet50v1layer4_batchnorm7_running_var (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer4_conv7_weight (shape=(512, 0, 3, 3), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer4_batchnorm8_gamma (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer4_batchnorm8_beta (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer4_batchnorm8_running_mean (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer4_batchnorm8_running_var (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer4_conv8_weight (shape=(2048, 0, 1, 1), dtype=<class 'numpy.float32'>) Parameter resnet50v1dense0_weight (shape=(10, 2048), dtype=float32) Parameter resnet50v1dense0_bias (shape=(10,), dtype=float32) )假设我们想加大最后全连接层的学习率,那么我们可以通过正则表达式将其参数选出来,net.collect_params('.*dense')结果如下:
print(net.collect_params('.*dense')) resnet50v1 ( Parameter resnet50v1dense0_weight (shape=(10, 2048), dtype=float32) Parameter resnet50v1dense0_bias (shape=(10,), dtype=float32) )选完了我们需要设置的参数后,最后只要设置其lr_mult属性就行,该层的学习率为lr*lr_mult,当lr_mult=0时,那么该层参数不会更新。
trainter = mx.gluon.Trainer(net.collect_params(),'sgd',{'learning_rate':0.1}) net.collect_params('.*dense').setattr('lr_mult',3)在net.collect_params()中,我们通过使用正则表达式匹配出需要单独设置的参数。最后再通过setattr()方法设置其学习率因子lr_mult,从而实现设置该层的学习率。所以设计网络时,我们可以通过为每层加上特定的prefix_name(前缀名),从而能让我们方便地用正则表达式匹配出每一层的参数。同理我们也可以通过这种方式来对不同层的参数进行单独初始化。
https://blog.csdn.net/xiangjiaojun_/article/details/85812248