【MXNet学习12】学习率的设定

    科技2022-08-04  116

    1、学习率

    目前深度学习使用的都是非常简单的一阶收敛算法,梯度下降法,不管有多少自适应的优化算法,本质上都是对梯度下降法的各种变形,所以,初始学习率对深层网络的收敛起着决定性的作用,下面就是梯度下降法的公式: α \alpha α 就是学习率,如果学习率太小,会导致网络loss下降非常慢,如果学习率太大,那么参数更新的幅度就非常大,就会导致网络收敛到局部最优点,或不会收敛或者loss直接开始增加;如下图所示: 学习率的选择策略在网络的训练过程中是不断在变化的,在刚开始的时候,参数比较随机,所以我们应该选择相对较大的学习率,这样loss下降更快;当训练一段时间之后,参数的更新就应该是更小的幅度,所以学习率一般会做衰减,衰减的方式也非常多,比如到一定的步数将学习率乘上0.1,也有指数衰减等。

    https://zhuanlan.zhihu.com/p/31424275

    2、mxnet 学习率设置方法

    学习率是优化器类optimizer的一个参数,设置学习率,就是给优化器传递这个参数,通常有两大类:

    一是静态常数学习率,可以在构造optimizer的时候传入常数参数即可还有一种是动态设置学习率,mxnet提供lr_scheduler模块来完成动态设置

    2.1、静态常数学习率

    sgd_optimizer = mx.optimizer.SGD(learning_rate=0.03, \ lr_scheduler=schedule) #schedule貌似没有定义 trainer = mx.gluon.Trainer(params=net.collect_params(), \ optimizer=sgd_optimizer)

    2.2、动态设置

    构造一个lrs,作为变量传递给优化器,当做优化器的lr_scheduler参数。

    lrs # 需要自己构造学习率调整策略 trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': learning_rate, 'wd': 0.001, 'lr_scheduler' : lrs})

    2.2.1、基类 LRScheduler

    class mxnet.lr_scheduler.LRScheduler(base_lr=0.01, warmup_steps=0, warmup_begin_lr=0, warmup_mode='linear') base_lr:基础学习率warmup_steps :需要多少次steps来j将warmup_begin_lr 增长到base_lr (一个epoch是指训练集中的所有样本都被训练完。一个step或iteration是指神经网络的权重更新一次)warmup_begin_lr :初始学习率warmup_mode :有’linear’ 和 'constant’两种模式,linear会持续增长学习率(以固定值),constant会保持warmup_begin_lr不变(到了指定的warmup_steps 时,变为base_lr ?) def get_warmup_lr(self, num_update): assert num_update < self.warmup_steps if self.warmup_mode == 'linear': increase = (self.warmup_final_lr - self.warmup_begin_lr) \ * float(num_update) / float(self.warmup_steps) return self.warmup_begin_lr + increase elif self.warmup_mode == 'constant': return self.warmup_begin_lr else: raise ValueError("Invalid warmup mode %s"%self.warmup_mode) 其中warmup_final_lr = base_lr个人理解num_update是训练过程中的step数(其最大值是一个epoch中的batch数目×总的epoch数目),而warmup_steps是预先设定好用于warmup模式的。因为没有实现get_warmup_lr(self, num_update)方法,这个基类不能直接用来作为参数传给优化器的lr_scheduler,必须使用它的子类!

    2.2.2、因子衰减 FactorScheduler

    对于基类LRScheduler,其参数使用默认的值,在LRScheduler的基础上,增加了几个参数:

    step:每隔step次更新后,调整学习率factor:调整因子,每次调整时乘以当前学习率得到调整后的学习率stop_factor_lr:最小学习率,达到这个值后不再更新学习率,可以避免过小学习率base_lr:1,使用这里的配置,可以得到下图调整策略: # Reduce the learning rate by a factor for every n steps. # It returns a new learning rate by: base_lr * pow(factor, floor(num_update/step)) 大概的图像: 源码 def __init__(self, step, factor=1, stop_factor_lr=1e-8, base_lr=0.01,\ warmup_steps=0, warmup_begin_lr=0, warmup_mode='linear'): super(FactorScheduler, self).__init__(base_lr, warmup_steps, warmup_begin_lr, warmup_mode) if step < 1: raise ValueError("Schedule step must be greater or equal than 1 round") if factor > 1.0: raise ValueError("Factor must be no more than 1 to make lr reduce") self.step = step self.factor = factor self.stop_factor_lr = stop_factor_lr self.count = 0 def __call__(self, num_update): if num_update < self.warmup_steps: return self.get_warmup_lr(num_update) # NOTE: use while rather than if (for continuing training via load_epoch) while num_update > self.count + self.step: self.count += self.step self.base_lr *= self.factor if self.base_lr < self.stop_factor_lr: self.base_lr = self.stop_factor_lr logging.info("Update[%d]: now learning rate arrived at %0.5e, will not " "change in the future", num_update, self.base_lr) else: logging.info("Update[%d]: Change learning rate to %0.5e", num_update, self.base_lr) return self.base_lr

    2.2.3、多因子衰减 MultiFactorScheduler

    因子衰减,每隔固定次数调整;多因子衰减,每达到list中的更新次数才调整。

    对于基类LRScheduler,其参数使用默认的值,在LRScheduler的基础上,增加了几个参数:

    step:与因子衰减相比,多因子衰减最大的区别是step不再是一个整数值,而是一个整数list,更新次数超过list中的一个值时,调整一次,区别只是在于调整的时机不一样factor:调整因子,每次调整时乘以当前学习率得到调整后的学习率stop_factor_lr:最小学习率,达到这个值后不再更新学习率,可以避免过小学习率base_lr:1,使用这里的配置,可以得到下图大概图像: 源码 def __init__(self, step, factor=1, base_lr=0.01, warmup_steps=0, warmup_begin_lr=0, warmup_mode='linear'): super(MultiFactorScheduler, self).__init__(base_lr, warmup_steps, warmup_begin_lr, warmup_mode) assert isinstance(step, list) and len(step) >= 1 for i, _step in enumerate(step): if i != 0 and step[i] <= step[i-1]: raise ValueError("Schedule step must be an increasing integer list") if _step < 1: raise ValueError("Schedule step must be greater or equal than 1 round") if factor > 1.0: raise ValueError("Factor must be no more than 1 to make lr reduce") self.step = step self.cur_step_ind = 0 self.factor = factor self.count = 0 def __call__(self, num_update): if num_update < self.warmup_steps: return self.get_warmup_lr(num_update) # NOTE: use while rather than if (for continuing training via load_epoch) while self.cur_step_ind <= len(self.step)-1: if num_update > self.step[self.cur_step_ind]: self.count = self.step[self.cur_step_ind] self.cur_step_ind += 1 self.base_lr *= self.factor logging.info("Update[%d]: Change learning rate to %0.5e", num_update, self.base_lr) else: return self.base_lr return self.base_lr

    2.2.4、多项式调整 PolyScheduler

    因子调整是用离散的数据来完成学习率的调整,很自然的能想到可以用连续函数来完成这一工作,而且由于学习率前面调整大,后面调整小,所以用底数位于(0,1)区间的幂函数来调整就非常合适。多项式策略就是用了幂函数来调整学习率。 对于基类LRScheduler,其参数使用默认的值,

    warmup_steps = 0,表示不适用warmup模式,相关的参数(如warmup_begin_lr、warmup_mode)使用默认值

    在LRScheduler的基础上,增加了几个参数:

    base_lr:设为1,从0次更新就开始调整,到最大更新次数(max_update = 1000)之后不再变化,如下图final_lr:最终学习率大概图像 源码 def __init__(self, max_update, base_lr=0.01, pwr=2, final_lr=0, warmup_steps=0, warmup_begin_lr=0, warmup_mode='linear'): super(PolyScheduler, self).__init__(base_lr, warmup_steps, warmup_begin_lr, warmup_mode) assert isinstance(max_update, int) if max_update < 1: raise ValueError("maximum number of updates must be strictly positive") self.power = pwr self.base_lr_orig = self.base_lr self.max_update = max_update self.final_lr = final_lr self.max_steps = self.max_update - self.warmup_steps def __call__(self, num_update): if num_update < self.warmup_steps: return self.get_warmup_lr(num_update) if num_update <= self.max_update: self.base_lr = self.final_lr + (self.base_lr_orig - self.final_lr) * \ pow(1 - float(num_update - self.warmup_steps) / float(self.max_steps), self.power) return self.base_lr

    2.2.5、余弦函数调整 CosineScheduler

    除了多项式函数,我们也可以利用余弦函数的单调性来完成学习率的调整,参数与多项式函数完全一样,只不过内部实现的函数完成了余弦函数:

    base_lr(初始学习率) :1

    final_lr(最终学习率) :0.1

    num_update(最大更新次数) :1000

    warmup_steps :0

    大概图像

    源码

    def __init__(self, max_update, base_lr=0.01, final_lr=0, warmup_steps=0, warmup_begin_lr=0, warmup_mode='linear'): super(CosineScheduler, self).__init__(base_lr, warmup_steps, warmup_begin_lr, warmup_mode) assert isinstance(max_update, int) if max_update < 1: raise ValueError("maximum number of updates must be strictly positive") self.base_lr_orig = base_lr self.max_update = max_update self.final_lr = final_lr self.max_steps = self.max_update - self.warmup_steps def __call__(self, num_update): if num_update < self.warmup_steps: return self.get_warmup_lr(num_update) if num_update <= self.max_update: self.base_lr = self.final_lr + (self.base_lr_orig - self.final_lr) * \ (1 + cos(pi * (num_update - self.warmup_steps) / self.max_steps)) / 2 return self.base_lr

    https://blog.csdn.net/gaussrieman123/article/details/99441123

    3、单独设置网络中每层的学习率

    迁移学习 (Finetune) 中我们经常需要固定pretrained层的学习率,或者把其学习率设置比后面的网络小,这就需要我们对不同的层设置不同的学习率,可以用net.collect_params(‘re’).setattr(‘lr_mult’,ratio)方法实现这个功能(net是一个Model类的对象)。 net.collect_params()将返回一个ParamterDict类型的变量,其中包含了网络中所有参数。

    # 函数原型 def collect_params(self,select=None)

    在函数原型中,select参数可以为一个正则表达式,从而collect_params()只会选择被该正则表达式匹配上的参数,也可以是某一层具体的名称:

    model.collect_params('conv1_weight|conv1_bias|fc_weight|fc_bias') model.collect_params('.*weight|.*bias')

    当我们需要单独设置学习率的时候,需要把单独设置学习率的参数都用正则表达式匹配出来。比如说下面的ResNet50,其所有参数如下(中间省略了一些层):

    print(net.collect_params()) resnet50v1 ( Parameter resnet50v1batchnorm0_gamma (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1batchnorm0_beta (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1batchnorm0_running_mean (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1batchnorm0_running_var (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1conv0_weight (shape=(64, 0, 5, 5), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm0_gamma (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm0_beta (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm0_running_mean (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm0_running_var (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_conv0_weight (shape=(64, 0, 1, 1), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm1_gamma (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm1_beta (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm1_running_mean (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm1_running_var (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_conv1_weight (shape=(64, 0, 3, 3), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm2_gamma (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm2_beta (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm2_running_mean (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm2_running_var (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_conv2_weight (shape=(256, 0, 1, 1), dtype=<class 'numpy.float32'>) Parameter resnet50v1conv1_weight (shape=(256, 0, 1, 1), dtype=<class 'numpy.float32'>) Parameter resnet50v1batchnorm1_gamma (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1batchnorm1_beta (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1batchnorm1_running_mean (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1batchnorm1_running_var (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm3_gamma (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm3_beta (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm3_running_mean (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm3_running_var (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_conv3_weight (shape=(64, 0, 1, 1), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm4_gamma (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm4_beta (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm4_running_mean (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm4_running_var (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_conv4_weight (shape=(64, 0, 3, 3), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm5_gamma (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm5_beta (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm5_running_mean (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm5_running_var (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_conv5_weight (shape=(256, 0, 1, 1), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer1_batchnorm6_gamma (shape=(0,), dtype=<class 'numpy.float32'>) ..... ..... ..... ..... ..... Parameter resnet50v1layer4_batchnorm7_running_var (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer4_conv7_weight (shape=(512, 0, 3, 3), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer4_batchnorm8_gamma (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer4_batchnorm8_beta (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer4_batchnorm8_running_mean (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer4_batchnorm8_running_var (shape=(0,), dtype=<class 'numpy.float32'>) Parameter resnet50v1layer4_conv8_weight (shape=(2048, 0, 1, 1), dtype=<class 'numpy.float32'>) Parameter resnet50v1dense0_weight (shape=(10, 2048), dtype=float32) Parameter resnet50v1dense0_bias (shape=(10,), dtype=float32) )

    假设我们想加大最后全连接层的学习率,那么我们可以通过正则表达式将其参数选出来,net.collect_params('.*dense')结果如下:

    print(net.collect_params('.*dense')) resnet50v1 ( Parameter resnet50v1dense0_weight (shape=(10, 2048), dtype=float32) Parameter resnet50v1dense0_bias (shape=(10,), dtype=float32) )

    选完了我们需要设置的参数后,最后只要设置其lr_mult属性就行,该层的学习率为lr*lr_mult,当lr_mult=0时,那么该层参数不会更新。

    trainter = mx.gluon.Trainer(net.collect_params(),'sgd',{'learning_rate':0.1}) net.collect_params('.*dense').setattr('lr_mult',3)

    在net.collect_params()中,我们通过使用正则表达式匹配出需要单独设置的参数。最后再通过setattr()方法设置其学习率因子lr_mult,从而实现设置该层的学习率。所以设计网络时,我们可以通过为每层加上特定的prefix_name(前缀名),从而能让我们方便地用正则表达式匹配出每一层的参数。同理我们也可以通过这种方式来对不同层的参数进行单独初始化。

    https://blog.csdn.net/xiangjiaojun_/article/details/85812248

    Processed: 0.019, SQL: 8