文章目录
1. FastText之train_supervised参数说明2. 参数选择实现:网格搜索+交叉验证2.1 my_gridsearch_cv主方法2.2 get_gridsearch_params2.3 get_KFold_scores2.4 使用示例
3. 完整代码
本博客中使用到的完整代码请移步至:
我的github https://github.com/qingyujean/eda-for-text-classification,求赞求星求鼓励~~~
1. FastText之train_supervised参数说明
input_file 训练文件路径(必须)
model skipgram或者CBOW default skipgram
lr 学习率 default 0.1
dim 词向量维度 default 100
ws 上下文窗口大小 default 5
epoch epochs 数量 default 5
min_count 最低词频 default 1
minCountLabel minimal number of label occurences default 0
minn 最小词长度 default 0
maxn 最大词长度 default 0
neg 负采样数目 default 5
wordNgrams n-gram 设置 default 1
loss 损失函数
{ns,hs,softmax
} default softmax
bucket number of buckets
[2000000
]
thread 线程数 multiprocessing.cpu_count
() - 1
lrUpdateRate 学习率更新速率 default 100
change the rate of updates
for the learning rate
t sampling threshold
[0.0001
]
label label prefix
[’_label_’
]
verbose default 2
pretrainedVectors 指定使用已有的词向量 .vec 文件 default
""
pretrained word vectors
(.vec file
) for supervised learning
2. 参数选择实现:网格搜索+交叉验证
2.1 my_gridsearch_cv主方法
def my_gridsearch_cv(df
, param_grid
, metrics
, kfold
=10):
n_classes
= len(np
.unique
(df
[1]))
print('n_classes', n_classes
)
kf
= KFold
(n_splits
=kfold
)
params_combination
= get_gridsearch_params
(param_grid
)
best_score
= 0.0
best_params
= dict()
for params
in params_combination
:
avg_score
= get_KFold_scores
(df
, params
, kf
, metrics
, n_classes
)
if avg_score
> best_score
:
best_score
= avg_score
best_params
= copy
.deepcopy
(params
)
return best_score
, best_params
这里面主要使用到2个方法,一个是get_gridsearch_params,用于获取参数的各种排列组合,一个是get_KFold_scores,用于获取每组参数在交叉验证集上的score。
2.2 get_gridsearch_params
将各个参数的取值进行排列组合
def get_gridsearch_params(param_grid
):
params_combination
= [dict()]
for k
, v_list
in param_grid
.items
():
tmp
= [{k
: v
} for v
in v_list
]
n
= len(params_combination
)
copy_params
= [copy
.deepcopy
(params_combination
) for _
in range(len(tmp
))]
params_combination
= sum(copy_params
, [])
_
= [params_combination
[i
*n
+k
].update
(tmp
[i
]) for k
in range(n
) for i
in range(len(tmp
))]
return params_combination
例如当tuned_parameters为如下情况是:
tuned_parameters
= {
'lr': [0.1],
'epoch': [15, 20, 25],
'dim': [50, 100, 150],
'wordNgrams': [2],
}
将各个参数的取值进行排列组合,例如在tuned_parameters的示例中,会产生1x3x3x1=9种组合,使用如下代码验证一下:
print(get_gridsearch_params
(tuned_parameters
))
返回:
[
{'lr': 0.1, 'epoch': 15, 'dim': 50, 'wordNgrams': 2},
{'lr': 0.1, 'epoch': 20, 'dim': 50, 'wordNgrams': 2},
{'lr': 0.1, 'epoch': 25, 'dim': 50, 'wordNgrams': 2},
{'lr': 0.1, 'epoch': 15, 'dim': 100, 'wordNgrams': 2},
{'lr': 0.1, 'epoch': 20, 'dim': 100, 'wordNgrams': 2},
{'lr': 0.1, 'epoch': 25, 'dim': 100, 'wordNgrams': 2},
{'lr': 0.1, 'epoch': 15, 'dim': 150, 'wordNgrams': 2},
{'lr': 0.1, 'epoch': 20, 'dim': 150, 'wordNgrams': 2},
{'lr': 0.1, 'epoch': 25, 'dim': 150, 'wordNgrams': 2}
]
2.3 get_KFold_scores
使用k折交叉验证,得到最后的score,保存最佳score以及其对应的那组参数
def get_KFold_scores(df
, params
, kf
, metric
, n_classes
):
metric_score
= 0.0
for train_idx
, val_idx
in kf
.split
(df
):
df_train
= df
.iloc
[train_idx
]
df_val
= df
.iloc
[val_idx
]
tmpdir
= tempfile
.mkdtemp
()
tmp_train_file
= tmpdir
+ '/train.txt'
df_train
.to_csv
(tmp_train_file
, sep
='\t', index
=False, header
=None, encoding
='UTF-8')
fast_model
= fasttext
.train_supervised
(tmp_train_file
, label_prefix
='__label__', thread
=3, **params
)
predicted
= fast_model
.predict
(df_val
[0].tolist
())
y_val_pred
= [int(label
[0][-1:]) for label
in predicted
[0]]
y_val
= [int(cls
[-1:]) for cls
in df_val
[1]]
score
= get_metrics
(y_val
, y_val_pred
, n_classes
)[metric
]
metric_score
+= score
shutil
.rmtree
(tmpdir
, ignore_errors
=True)
print('平均分:', metric_score
/ kf
.n_splits
)
return metric_score
/ kf
.n_splits
2.4 使用示例
import fasttext
from sklearn
.model_selection
import KFold
, StratifiedKFold
import numpy
as np
import pandas
as pd
import copy
import tempfile
import shutil
from fast
import get_metrics
DATA_PATH
= '../data/'
tuned_parameters
= {
'lr': [0.1, 0.05],
'epoch': [15, 20, 25, 30],
'dim': [50, 100, 150, 200],
'wordNgrams': [2, 3],
}
if __name__
== '__main__':
filepath
= DATA_PATH
+ 'fast/augmented/js_pd_tagged_train.txt'
df
= pd
.read_csv
(filepath
, encoding
='UTF-8', sep
='\t', header
=None, index_col
=False, usecols
=[0, 1])
print(df
.head
())
print(df
.shape
)
best_score
, best_params
= my_gridsearch_cv
(df
, tuned_parameters
, 'accuracy', kfold
=5)
print('best_score', best_score
)
print('best_params', best_params
)
3. 完整代码
完整代码请移步至: 我的github https://github.com/qingyujean/eda-for-text-classification,求赞求星求鼓励~~~
最后:如果本文中出现任何错误,请您一定要帮忙指正,感激~