TASK 3:特征工程

    科技2022-09-07  151

    3.1学习目标

    学习特征预处理、缺失值、异常值处理、数据分桶等特征处理方法学习特征交互、编码、选择的相应方法完成相应学习打卡任务

    3.2内容介绍

    数据预处理 缺失值的填充时间格式处理对象类型特征转换到数值 异常值处理 基于3segama原则基于箱型图 数据分箱 固定宽度分箱分位数分箱 离散数值型数据分箱连续数值型数据分箱 卡方分箱(选做作业) 特征交互 特征和特征之间组合特征和特征之间衍生其他特征衍生的尝试(选做作业) 特征编码 one-hot编码label-encode编码 特征选择 1 Filter2 Wrapper (RFE)3 Embedded import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import datetime from tqdm import tqdm from sklearn.preprocessing import LabelEncoder from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 from sklearn.preprocessing import MinMaxScaler import xgboost as xgb import lightgbm as lgb from catboost import CatBoostRegressor import warnings from sklearn.model_selection import StratifiedKFold,KFold from sklearn.metrics import accuracy_score,f1_score,roc_auc_score,log_loss warnings.filterwarnings('ignore') data_train=pd.read_csv('train.csv') data_test_a=pd.read_csv('testA.csv')

    ps:上面写了贼多我第一次见的包,以下是一些关于这些包的简略说明和使用方法

    tqdm 是一个快速,可扩展的Python进度条,可以在 Python 长循环中添加一个进度提示信息,用户只需要封装任意的迭代器 tqdm(iterator) import time for i in tqdm(range(0,10)): time.sleep(0.1) 100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 9.89it/s]

    在进行特征向量的处理的时候,我们经常需要将分类的变量变化以数字形式表达的变量,比如某一个特征向量里为:‘amsterdam’, ‘paris’, 'tokyo‘ 三个地名,不能直接运用于模型中,需要转换为数字变量,比如:amsterdam对于为0,paris为1,tokyo为2,这时我们可以通过LabelEncoder对特征值进行编码。

    其功能主要有一下两点:

    LabelEncoder可用于规范标签

    import sklearn le=sklearn.preprocessing.LabelEncoder() le.fit([1, 2, 2, 6]) #将list:[1,2,2,6]进行标签编码 le.classes_ #获取标签值 le.transform([1, 1, 2, 6])#将标签值标准化 le.inverse_transform([0, 0, 1, 2])#返回标签值原来的编码 array([1, 1, 2, 6]) 也可以用于将非数字标签转换为数字标签 le.fit(["paris", "paris", "tokyo", "amsterdam"]) #将上述数组进行增添标签 le.classes_ #获取标签值 le.transform(["paris", "paris", "tokyo", "amsterdam"])#将标签值标准化 le.inverse_transform([1, 1, 2, 0])#返回标签值原来的编码 array(['paris', 'paris', 'tokyo', 'amsterdam'], dtype='<U9') from sklearn.feature_selection import SelectKBestfrom sklearn.feature_selection import chi2https://blog.csdn.net/weixin_46072771/article/details/106182410(不是很懂,存着之后看)MinMaxScalerhttps://blog.csdn.net/weixin_45850137/article/details/105936893xgboosthttps://blog.csdn.net/sb19931201/article/details/52557382CatBoostRegressorhttps://blog.csdn.net/linxid/article/details/80723811https://blog.csdn.net/fengdu78/article/details/103908062整体框架可以借鉴https://blog.csdn.net/weixin_41358871/article/details/81350546

    3.3.2特征预处理

    数据EDA部分我们已经对数据的大概和某些特征分布有了了解,数据预处理部分一般我们要处理一些EDA阶段分析出来的问题,这里介绍了数据缺失值的填充,时间格式特征的转化处理,某些对象类别特征的处理。首先我们查找出数据中的对象特征和数值特征 data_train.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 800000 entries, 0 to 799999 Data columns (total 47 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 800000 non-null int64 1 loanAmnt 800000 non-null float64 2 term 800000 non-null int64 3 interestRate 800000 non-null float64 4 installment 800000 non-null float64 5 grade 800000 non-null object 6 subGrade 800000 non-null object 7 employmentTitle 799999 non-null float64 8 employmentLength 753201 non-null object 9 homeOwnership 800000 non-null int64 10 annualIncome 800000 non-null float64 11 verificationStatus 800000 non-null int64 12 issueDate 800000 non-null object 13 isDefault 800000 non-null int64 14 purpose 800000 non-null int64 15 postCode 799999 non-null float64 16 regionCode 800000 non-null int64 17 dti 799761 non-null float64 18 delinquency_2years 800000 non-null float64 19 ficoRangeLow 800000 non-null float64 20 ficoRangeHigh 800000 non-null float64 21 openAcc 800000 non-null float64 22 pubRec 800000 non-null float64 23 pubRecBankruptcies 799595 non-null float64 24 revolBal 800000 non-null float64 25 revolUtil 799469 non-null float64 26 totalAcc 800000 non-null float64 27 initialListStatus 800000 non-null int64 28 applicationType 800000 non-null int64 29 earliesCreditLine 800000 non-null object 30 title 799999 non-null float64 31 policyCode 800000 non-null float64 32 n0 759730 non-null float64 33 n1 759730 non-null float64 34 n2 759730 non-null float64 35 n3 759730 non-null float64 36 n4 766761 non-null float64 37 n5 759730 non-null float64 38 n6 759730 non-null float64 39 n7 759730 non-null float64 40 n8 759729 non-null float64 41 n9 759730 non-null float64 42 n10 766761 non-null float64 43 n11 730248 non-null float64 44 n12 759730 non-null float64 45 n13 759730 non-null float64 46 n14 759730 non-null float64 dtypes: float64(33), int64(9), object(5) memory usage: 286.9+ MB numerical_fea=list(data_train.select_dtypes(exclude=['object']).columns) category_fea=list(data_train.select_dtypes('object').columns) label = 'isDefault' numerical_fea.remove(label)

    在比赛中数据预处理是必不可少的一部分,对于缺失值的填充往往会影响比赛的结果,在比赛中不妨尝试多种填充然后比较结果选择结果最优的一种; 比赛数据相比真实场景的数据相对要“干净”一些,但是还是会有一定的“脏”数据存在,清洗一些异常值往往会获得意想不到的效果。

    numerical_fea ['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'employmentTitle', 'homeOwnership', 'annualIncome', 'verificationStatus', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'initialListStatus', 'applicationType', 'title', 'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14'] 缺失值填充 把所有缺失值替换为还指定的值0:data_train = data_train.fillna(0)向用缺失值上面的值替换缺失值: data_train = data_train.fillna(axis=0,method=‘ffill’)纵向用缺失值下面的值替换缺失值,且设置最多只填充两个连续的缺失值:data_train = data_train.fillna(axis=0,method=‘bfill’,limit=2) data_train.head() idloanAmntterminterestRateinstallmentgradesubGradeemploymentTitleemploymentLengthhomeOwnership...n5n6n7n8n9n10n11n12n13n140035000.0519.52917.97EE2320.02 years2...9.08.04.012.02.07.00.00.00.02.01118000.0518.49461.90DD2219843.05 years0...NaNNaNNaNNaNNaN13.0NaNNaNNaNNaN2212000.0516.99298.17DD331698.08 years0...0.021.04.05.03.011.00.00.00.04.03311000.037.26340.96AA446854.010+ years1...16.04.07.021.06.09.00.00.00.01.0443000.0312.99101.07CC254.0NaN1...4.09.010.015.07.012.00.00.00.04.0

    5 rows × 47 columns

    #查看缺失情况 data_train.isnull().sum() id 0 loanAmnt 0 term 0 interestRate 0 installment 0 grade 0 subGrade 0 employmentTitle 1 employmentLength 46799 homeOwnership 0 annualIncome 0 verificationStatus 0 issueDate 0 isDefault 0 purpose 0 postCode 1 regionCode 0 dti 239 delinquency_2years 0 ficoRangeLow 0 ficoRangeHigh 0 openAcc 0 pubRec 0 pubRecBankruptcies 405 revolBal 0 revolUtil 531 totalAcc 0 initialListStatus 0 applicationType 0 earliesCreditLine 0 title 1 policyCode 0 n0 40270 n1 40270 n2 40270 n3 40270 n4 33239 n5 40270 n6 40270 n7 40270 n8 40271 n9 40270 n10 33239 n11 69752 n12 40270 n13 40270 n14 40270 dtype: int64 #按照平均数填充数值型特征 data_train[numerical_fea]=data_train[numerical_fea].fillna(data_train[numerical_fea].median()) data_test_a[numerical_fea]=data_test_a[numerical_fea].fillna(data_test_a[numerical_fea].median()) #按照众数填充类别型特征 data_train[category_fea]=data_train[category_fea].fillna(data_train[numerical_fea].mode()) data_test_a[category_fea]=data_test_a[category_fea].fillna(data_test_a[category_fea].mode()) data_train.isnull().sum() id 0 loanAmnt 0 term 0 interestRate 0 installment 0 grade 0 subGrade 0 employmentTitle 0 employmentLength 46799 homeOwnership 0 annualIncome 0 verificationStatus 0 issueDate 0 isDefault 0 purpose 0 postCode 0 regionCode 0 dti 0 delinquency_2years 0 ficoRangeLow 0 ficoRangeHigh 0 openAcc 0 pubRec 0 pubRecBankruptcies 0 revolBal 0 revolUtil 0 totalAcc 0 initialListStatus 0 applicationType 0 earliesCreditLine 0 title 0 policyCode 0 n0 0 n1 0 n2 0 n3 0 n4 0 n5 0 n6 0 n7 0 n8 0 n9 0 n10 0 n11 0 n12 0 n13 0 n14 0 dtype: int64 #查看类别特征 category_fea ['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine'] category_fea:对象型类别特征需要进行预处理,其中[‘issueDate’]为时间格式特征

    时间格式处理

    #转换成时间格式 for data in [data_train,data_test_a]: data.issueDate=pd.to_datetime(data.issueDate,format='%Y-%m-%d') startdate=datetime.datetime.strptime('2007-06-01','%Y-%m-%d') #构造时间特征 data['issueDateDT']=data['issueDate'].apply(lambda x:x-startdate).dt.days data idloanAmntterminterestRateinstallmentgradesubGradeemploymentTitleemploymentLengthhomeOwnership...n6n7n8n9n10n11n12n13n14issueDateDT080000014000.0310.99458.28BB37027.010+ years0...4.015.019.06.017.00.00.01.03.02587180000120000.0514.65472.14CC560426.010+ years0...3.03.09.03.05.00.00.02.02.02952280000212000.0319.99445.91DD423547.02 years1...36.05.06.04.012.00.00.00.07.03410380000317500.0514.31410.02CC4636.04 years0...2.08.014.02.010.00.00.00.03.02710480000435000.0317.091249.42DD1368446.0< 1 year1...3.016.018.011.019.00.00.00.01.03775..................................................................1999959999957000.0311.14229.64BB2330967.07 years1...11.02.06.02.08.00.00.00.04.019491999969999966000.036.24183.19AA238930.01 year1...14.012.013.06.025.00.00.00.00.0304419999799999714000.0515.88339.57CC4282016.08 years2...18.021.042.013.021.00.00.00.00.022221999989999988000.0318.06289.47DD297.04 years1...5.08.019.06.011.00.00.00.02.037751999999999998000.036.68245.85AA3320.07 years1...4.03.04.02.04.00.00.00.00.02802

    200000 rows × 47 columns

    https://blog.csdn.net/littlehaes/article/details/103892187 data_train['employmentLength'].value_counts(dropna=False).sort_index() 1 year 52489 10+ years 262753 2 years 72358 3 years 64152 4 years 47985 5 years 50102 6 years 37254 7 years 35407 8 years 36192 9 years 30272 < 1 year 64237 NaN 46799 Name: employmentLength, dtype: int64 https://blog.csdn.net/kai123wen/article/details/99321824

    对象类型特征转换到数值

    def employmentLength_to_int(s): if pd.isnull(s): return s else: return np.int8(s.split()[0]) for data in [data_train,data_test_a]: data.employmentLength.replace('10+ years','10 years',inplace=True) data.employmentLength.replace('< 1 year','0 years',inplace=True) data.employmentLength=data.employmentLength.apply(employmentLength_to_int) data idloanAmntterminterestRateinstallmentgradesubGradeemploymentTitleemploymentLengthhomeOwnership...n6n7n8n9n10n11n12n13n14issueDateDT080000014000.0310.99458.28BB37027.010.00...4.015.019.06.017.00.00.01.03.02587180000120000.0514.65472.14CC560426.010.00...3.03.09.03.05.00.00.02.02.02952280000212000.0319.99445.91DD423547.02.01...36.05.06.04.012.00.00.00.07.03410380000317500.0514.31410.02CC4636.04.00...2.08.014.02.010.00.00.00.03.02710480000435000.0317.091249.42DD1368446.00.01...3.016.018.011.019.00.00.00.01.03775..................................................................1999959999957000.0311.14229.64BB2330967.07.01...11.02.06.02.08.00.00.00.04.019491999969999966000.036.24183.19AA238930.01.01...14.012.013.06.025.00.00.00.00.0304419999799999714000.0515.88339.57CC4282016.08.02...18.021.042.013.021.00.00.00.00.022221999989999988000.0318.06289.47DD297.04.01...5.08.019.06.011.00.00.00.02.037751999999999998000.036.68245.85AA3320.07.01...4.03.04.02.04.00.00.00.00.02802

    200000 rows × 47 columns

    data.employmentLength.value_counts(dropna=False).sort_index() 0.0 15989 1.0 13182 2.0 18207 3.0 16011 4.0 11833 5.0 12543 6.0 9328 7.0 8823 8.0 8976 9.0 7594 10.0 65772 NaN 11742 Name: employmentLength, dtype: int64 对earliesCreditLine进行预处理 data_train['earliesCreditLine'].sample(5) 513127 Aug-1999 4329 Feb-2000 719226 Dec-1984 316424 Oct-1999 157586 Sep-2002 Name: earliesCreditLine, dtype: object

    https://blog.csdn.net/marraybug/article/details/84972816

    for data in [data_train,data_test_a]: data['earliesCreditLine']=data.earliesCreditLine.apply(lambda x:int(x[-4:]))

    类别特征处理

    #部分类别特征 cate_features=['grade','subGrade','employmentTitle','homeOwnership','verificationStatus', 'purpose', 'postCode', 'regionCode', \ 'applicationType', 'initialListStatus', 'title', 'policyCode'] for f in cate_features: print(f,'类型数:',data[f].nunique()) grade 类型数: 7 subGrade 类型数: 35 employmentTitle 类型数: 79282 homeOwnership 类型数: 6 verificationStatus 类型数: 3 purpose 类型数: 14 postCode 类型数: 889 regionCode 类型数: 51 applicationType 类型数: 2 initialListStatus 类型数: 2 title 类型数: 12058 policyCode 类型数: 1 像等级grade这种类别特征,是有优先级的可以labelencode或者自映射 for data in [data_train, data_test_a]: data['grade'] = data['grade'].map({'A':1,'B':2,'C':3,'D':4,'E':5,'F':6,'G':7}) #类型数在2之上,又不是高维稀疏的,且纯分类特征 for data in [data_train,data_test_a]: data=pd.get_dummies(data,columns=['subGrade','homeOwnership','verificationStatus', 'purpose', 'regionCode'],drop_first=True)

    3.3.3异常值处理

    从当你发现异常值后,一定要先分清是什么原因导致的异常值,然后再考虑如何处理。首先,如果这一异常值并不代表一种规律性的,而是极其偶然的现象,或者说你并不想研究这种偶然的现象,这时可以将其删除。其次,如果异常值存在且代表了一种真实存在的现象,那就不能随便删除。在现有的欺诈场景中很多时候欺诈数据本身相对于正常数据勒说就是异常的,我们要把这些异常点纳入,重新拟合模型,研究其规律。能用监督的用监督模型,不能用的还可以考虑用异常检测的算法来做。注意test的数据不能删。

    检测异常值的方法一:均方差

    在统计学中,如果一个数据分布近似正态,那么大约 68% 的数据值会在均值的一个标准差范围内,大约 95% 会在两个标准差范围内,大约 99.7% 会在三个标准差范围内。

    def find_outliers_by_3segama(data,fea): data_std=np.std(data[fea]) data_mean=np.mean(data[fea]) outliers_cut_off=data_std*3 lower_rule=data_mean-outliers_cut_off upper_rule=data_mean+outliers_cut_off data[fea+'_outliers']=data[fea].apply(lambda x:str('异常值') if x>upper_rule or x< lower_rule else '正常值') return data 得到特征的异常值后可以进一步分析变量异常值和目标变量的关系 data_train=data_train.copy() for fea in numerical_fea: data_train=find_outliers_by_3segama(data_train,fea) print(data_train[fea+'_outliers'].value_counts()) print(data_train.groupby(fea+'_outliers')['isDefault'].sum()) print('*'*10) 正常值 800000 Name: id_outliers, dtype: int64 id_outliers 正常值 159610 Name: isDefault, dtype: int64 ********** 正常值 800000 Name: loanAmnt_outliers, dtype: int64 loanAmnt_outliers 正常值 159610 Name: isDefault, dtype: int64 ********** 正常值 800000 Name: term_outliers, dtype: int64 term_outliers 正常值 159610 Name: isDefault, dtype: int64 ********** 正常值 794259 异常值 5741 Name: interestRate_outliers, dtype: int64 interestRate_outliers 异常值 2916 正常值 156694 Name: isDefault, dtype: int64 ********** 正常值 792046 异常值 7954 Name: installment_outliers, dtype: int64 installment_outliers 异常值 2152 正常值 157458 Name: isDefault, dtype: int64 ********** 正常值 800000 Name: employmentTitle_outliers, dtype: int64 employmentTitle_outliers 正常值 159610 Name: isDefault, dtype: int64 ********** 正常值 799701 异常值 299 Name: homeOwnership_outliers, dtype: int64 homeOwnership_outliers 异常值 62 正常值 159548 Name: isDefault, dtype: int64 ********** 正常值 793973 异常值 6027 Name: annualIncome_outliers, dtype: int64 annualIncome_outliers 异常值 756 正常值 158854 Name: isDefault, dtype: int64 ********** 正常值 800000 Name: verificationStatus_outliers, dtype: int64 verificationStatus_outliers 正常值 159610 Name: isDefault, dtype: int64 ********** 正常值 783003 异常值 16997 Name: purpose_outliers, dtype: int64 purpose_outliers 异常值 3635 正常值 155975 Name: isDefault, dtype: int64 ********** 正常值 798931 异常值 1069 Name: postCode_outliers, dtype: int64 postCode_outliers 异常值 221 正常值 159389 Name: isDefault, dtype: int64 ********** 正常值 799994 异常值 6 Name: regionCode_outliers, dtype: int64 regionCode_outliers 异常值 1 正常值 159609 Name: isDefault, dtype: int64 ********** 正常值 798440 异常值 1560 Name: dti_outliers, dtype: int64 dti_outliers 异常值 466 正常值 159144 Name: isDefault, dtype: int64 ********** 正常值 778245 异常值 21755 Name: delinquency_2years_outliers, dtype: int64 delinquency_2years_outliers 异常值 5089 正常值 154521 Name: isDefault, dtype: int64 ********** 正常值 788261 异常值 11739 Name: ficoRangeLow_outliers, dtype: int64 ficoRangeLow_outliers 异常值 778 正常值 158832 Name: isDefault, dtype: int64 ********** 正常值 788261 异常值 11739 Name: ficoRangeHigh_outliers, dtype: int64 ficoRangeHigh_outliers 异常值 778 正常值 158832 Name: isDefault, dtype: int64 ********** 正常值 790889 异常值 9111 Name: openAcc_outliers, dtype: int64 openAcc_outliers 异常值 2195 正常值 157415 Name: isDefault, dtype: int64 ********** 正常值 792471 异常值 7529 Name: pubRec_outliers, dtype: int64 pubRec_outliers 异常值 1701 正常值 157909 Name: isDefault, dtype: int64 ********** 正常值 794120 异常值 5880 Name: pubRecBankruptcies_outliers, dtype: int64 pubRecBankruptcies_outliers 异常值 1423 正常值 158187 Name: isDefault, dtype: int64 ********** 正常值 790001 异常值 9999 Name: revolBal_outliers, dtype: int64 revolBal_outliers 异常值 1359 正常值 158251 Name: isDefault, dtype: int64 ********** 正常值 799948 异常值 52 Name: revolUtil_outliers, dtype: int64 revolUtil_outliers 异常值 23 正常值 159587 Name: isDefault, dtype: int64 ********** 正常值 791663 异常值 8337 Name: totalAcc_outliers, dtype: int64 totalAcc_outliers 异常值 1668 正常值 157942 Name: isDefault, dtype: int64 ********** 正常值 800000 Name: initialListStatus_outliers, dtype: int64 initialListStatus_outliers 正常值 159610 Name: isDefault, dtype: int64 ********** 正常值 784586 异常值 15414 Name: applicationType_outliers, dtype: int64 applicationType_outliers 异常值 3875 正常值 155735 Name: isDefault, dtype: int64 ********** 正常值 775134 异常值 24866 Name: title_outliers, dtype: int64 title_outliers 异常值 3900 正常值 155710 Name: isDefault, dtype: int64 ********** 正常值 800000 Name: policyCode_outliers, dtype: int64 policyCode_outliers 正常值 159610 Name: isDefault, dtype: int64 ********** 正常值 782773 异常值 17227 Name: n0_outliers, dtype: int64 n0_outliers 异常值 3485 正常值 156125 Name: isDefault, dtype: int64 ********** 正常值 790500 异常值 9500 Name: n1_outliers, dtype: int64 n1_outliers 异常值 2491 正常值 157119 Name: isDefault, dtype: int64 ********** 正常值 789067 异常值 10933 Name: n2_outliers, dtype: int64 n2_outliers 异常值 3205 正常值 156405 Name: isDefault, dtype: int64 ********** 正常值 789067 异常值 10933 Name: n3_outliers, dtype: int64 n3_outliers 异常值 3205 正常值 156405 Name: isDefault, dtype: int64 ********** 正常值 788660 异常值 11340 Name: n4_outliers, dtype: int64 n4_outliers 异常值 2476 正常值 157134 Name: isDefault, dtype: int64 ********** 正常值 790355 异常值 9645 Name: n5_outliers, dtype: int64 n5_outliers 异常值 1858 正常值 157752 Name: isDefault, dtype: int64 ********** 正常值 786006 异常值 13994 Name: n6_outliers, dtype: int64 n6_outliers 异常值 3182 正常值 156428 Name: isDefault, dtype: int64 ********** 正常值 788430 异常值 11570 Name: n7_outliers, dtype: int64 n7_outliers 异常值 2746 正常值 156864 Name: isDefault, dtype: int64 ********** 正常值 789625 异常值 10375 Name: n8_outliers, dtype: int64 n8_outliers 异常值 2131 正常值 157479 Name: isDefault, dtype: int64 ********** 正常值 786384 异常值 13616 Name: n9_outliers, dtype: int64 n9_outliers 异常值 3953 正常值 155657 Name: isDefault, dtype: int64 ********** 正常值 788979 异常值 11021 Name: n10_outliers, dtype: int64 n10_outliers 异常值 2639 正常值 156971 Name: isDefault, dtype: int64 ********** 正常值 799434 异常值 566 Name: n11_outliers, dtype: int64 n11_outliers 异常值 112 正常值 159498 Name: isDefault, dtype: int64 ********** 正常值 797585 异常值 2415 Name: n12_outliers, dtype: int64 n12_outliers 异常值 545 正常值 159065 Name: isDefault, dtype: int64 ********** 正常值 788907 异常值 11093 Name: n13_outliers, dtype: int64 n13_outliers 异常值 2482 正常值 157128 Name: isDefault, dtype: int64 ********** 正常值 788884 异常值 11116 Name: n14_outliers, dtype: int64 n14_outliers 异常值 3364 正常值 156246 Name: isDefault, dtype: int64 ********** 例如可以看到异常值在两个变量上的分布几乎复合整体的分布,如果异常值都属于为1的用户数据里面代表什么呢? #删除异常值 for fea in numerical_fea: data_train=data_train[data_train[fea+'_outliers']=='正常值'] data_train=data_train.reset_index(drop=True)

    检测异常的方法二:箱型图

    总结一句话:四分位数会将数据分为三个点和四个区间,IQR=Q3-Q1,下触须=Q1 − 1.5* IQR,上触须=Q3 + 1.5* IQR;

    数据分桶

    特征分箱的目的: 从模型效果上来看,特征分箱主要是为了降低变量的复杂性,减少变量噪音对模型的影响,提高自变量和因变量的相关度。从而使模型更加稳定。 数据分桶的对象: 将连续变量离散化将多状态的离散变量合并成少状态 分箱的原因: 数据的特征内的值跨度可能比较大,对有监督和无监督中如k-均值聚类它使用欧氏距离作为相似度函数来测量数据点之间的相似度。都会造成大吃小的影响,其中一种解决方法是对计数值进行区间量化即数据分桶也叫做数据分箱,然后使用量化后的结果。 分箱的优点: 处理缺失值:当数据源可能存在缺失值,此时可以把null单独作为一个分箱。处理异常值:当数据中存在离群点时,可以把其通过分箱离散化处理,从而提高变量的鲁棒性(抗干扰能力)。例如,age若出现200这种异常值,可分入“age > 60”这个分箱里,排除影响。业务解释性:我们习惯于线性判断变量的作用,当x越来越大,y就越来越大。但实际x与y之间经常存在着非线性关系,此时可经过WOE变换。 特别要注意一下分箱的基本原则: (1)最小分箱占比不低于5%(2)箱内不能全部是好客户(3)连续箱单调

    1.固定宽度分箱

    当数值横跨多个数量级时,最好按照 10 的幂(或任何常数的幂)来进行分组:09、1099、100999、10009999,等等。固定宽度分箱非常容易计算,但如果计数值中有比较大的缺口,就会产生很多没有任何数据的空箱子。 #通过除法映射到间隔均匀的分箱中,每个分箱的取值范围都是loanAmnt/1000 data['loanAmnt_bin1']=np.floor_divide(data['loanAmnt'],1000) #通过对数函数映射到指数宽度分箱 data['loanAmnt_bin2']=np.floor(np.log10(data['loanAmnt']))

    2.分位数分箱

    https://zhuanlan.zhihu.com/p/144234097?from_voters_page=true data['loanAmnt_bin3']=pd.qcut(data['loanAmnt'],10,labels=False)

    3.卡方分箱及其他分箱方法的尝试

    这一部分属于进阶部分。

    3.3.5特征交互

    交互特征的构造非常简单,使用起来却代价不菲。如果线性模型中包含有交互特征对,那它的训练时间和评分时间就会从 O(n) 增加到 O(n2),其中 n 是单一特征的数量。

    难难难难难难!!!!

    for col in ['grade','subGrade']: temp_dict=data_train.groupby([col])['isDefault'].agg(['mean']).reset_index().rename(columns={'mean':col+'_target_mean'}) temp_dict.index=temp_dict[col].values temp_dict=temp_dict[col+'_target_mean'].to_dict() data_train[col+'_target_mean']=data_train[col].map(temp_dict) data_test_a[col+'_target_mean']=data_test_a[col].map(temp_dict) #其他衍生变量mean和std for df in [data_train,data_test_a]: for item in ['n0','n1','n2','n3','n4','n5','n6','n7','n8','n9','n10','n11','n12','n13','n14']: df['grade_to_mean_'+item]=df['grade']/df.groupby([item])['grade'].transform('mean') df['grade_to_std_'+item]=df['grade']/df.groupby([item])['grade'].transform('std')

    3.3.6特征编码

    labelEncode直接放入数模型中

    #label-encode:subGrade,postGode,title #高维类别特征需要进行转换 for col in tqdm(['employmentTitle', 'postCode', 'title','subGrade']): le=LabelEncoder() le.fit(list(data_train[col].astype(str).values)+list(data_test_a[col].astype(str).values)) data_train[col]=le.transform(list(data_train[col].astype(str).values)) data_test_a[col]=le.transform(list(data_test_a[col].astype(str).values)) print('Label Encoding 完成') 100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00, 1.31s/it] Label Encoding 完成 a=le.inverse_transform(list(data_train[col])) a data_train[col] 0 21 1 16 2 17 3 3 4 12 .. 612737 21 612738 13 612739 12 612740 3 612741 7 Name: subGrade, Length: 612742, dtype: int64 le.fit([“paris”, “paris”, “tokyo”, “amsterdam”]) #将上述数组进行增添标签le.classes_ #获取标签值le.transform([“paris”, “paris”, “tokyo”, “amsterdam”])#将标签值标准化le.inverse_transform([1, 1, 2, 0])#返回标签值原来的编码

    逻辑回归等模型要单独增加的特征工程

    对特征做归一化,去除相关性高的特征归一化目的是让训练过程更好更快的收敛,避免特征大吃小的问题去除相关性是增加模型的可解释性,加快预测过程。 #举例归一化过程(书上P175) #伪代码 for fea in [要归一化的特征列表]: data[fea]=((data[fea]-np.min(data[fea]))/(np.max(data[fea])-np.min(data[fea])))

    3.3.7特征选择

    特征选择技术可以精简掉无用的特征,以降低最终模型的复杂性,它的最终目的是得到一个简约模型,在不降低预测准确率或对预测准确率影响不大的情况下提高计算速度。特征选择不是为了减少训练时间(实际上,一些技术会增加总体训练时间),而是为了减少模型评分时间。

    特征选择的方法:

    Fiter 方差选择法相关系数法(pearson 相关系数)卡方检验互信息法 Wrapper (RFE) 递归特征消除法 Embedded 基于惩罚项的特征选择法基于树模型的特征选择

    Filter:基于特征间的关系进行筛选

    方差选择法:先计算各个特征的方差,然后根据设定的阈值,选择方差大于阈值的特征 from sklearn.feature_selection import VarianceThreshold #其中参数threshold为方差的阈值 VarianceThreshold(threshold=3).fit_transform(train,target_train) 相关系数法: Pearson 相关系数 皮尔森相关系数是一种最简单的,可以帮助理解特征和响应变量之间关系的方法,该方法衡量的是变量之间的线性相关性。 结果的取值区间为 [-1,1] , -1 表示完全的负相关, +1表示完全的正相关,0 表示没有线性相关。 from sklearn.feature_selection import SelectKBest from scipy.stats import pearsonr #选择k个最好的特征,返回选择特征后的数据 #第一个参数为计算评估特征是否好的函数,该函数输入特征矩阵和目标向量, #输出二元组(评分,P值)的数组,数组第i项为第i个特征的评分和P值。在此定义为计算相关系数 #参数k为选择的特征个数 SelectKBest(k=5).fit_transform(train,target_train) 卡方检验:经典的卡方检验是用于检验自变量对因变量的相关性。 假设自变量有N种取值,因变量有M种取值,考虑自变量等于i且因变量等于j的样本频数的观察值与期望的差距。 其统计量如下: χ2=∑(A−T)2T,其中A为实际值,T为理论值 from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 #参数k为选择的特征个数 SelectKBest(chi2,k=5).fit_transform(train,target_train) 互动信息法: 经典的互信息也是评价自变量对因变量的相关性的。 在feature_selection库的SelectKBest类结合最大信息系数法可以用于选择特征 from sklearn.feature_selection import SelectKBest from minepy import MINE #由于MINE的设计不是函数式的,定义mic方法将其为函数式的, #返回一个二元组,二元组的第2项设置成固定的P值0.5 def mic(x, y): m = MINE() m.compute_score(x, y) return (m.mic(), 0.5) #参数k为选择的特征个数 SelectKBest(lambda X, Y: array(map(lambda x:mic(x, Y), X.T)).T, k=2).fit_transform(train,target_train)

    Wrapper (Recursive feature elimination,RFE)

    递归特征消除法: 递归消除特征法使用一个基模型来进行多轮训练,每轮训练后,消除若干权值系数的特征,再基于新的特征集进行下一轮训练。 在feature_selection库的RFE类可以用于选择特征,相关代码如下(以逻辑回归为例): from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression #递归特征消除法,返回特征选择后的数据 #参数estimator为基模型 #参数n_features_to_select为选择的特征个数 RFE(estimator=LogisticRegression(), n_features_to_select=2).fit_transform(train,target_train)

    Embedded

    基于惩罚项的特征选择法: 使用带惩罚项的基模型,除了筛选出特征外,同时也进行了降维。 在feature_selection库的SelectFromModel类结合逻辑回归模型可以用于选择特征,相关代码如下: from sklearn.feature_selection import SelectFromModel from sklearn.linear_model import LogisticRegression #带L1惩罚项的逻辑回归作为基模型的特征选择 SelectFromModel(LogisticRegression(penalty="l1", C=0.1)).fit_transform(train,target_train) 基于树模型的特征选择: 树模型中GBDT也可用来作为基模型进行特征选择。 在feature_selection库的SelectFromModel类结合GBDT模型可以用于选择特征,相关代码如下: from sklearn.feature_selection import SelectFromModel from sklearn.ensemble import GradientBoostingClassifier #GBDT作为基模型的特征选择 SelectFromModel(GradientBoostingClassifier()).fit_transform(train,target_train)

    数据处理

    本数据集中我们删除非入模特征后,并对缺失值填充,然后用计算协方差的方式看一下特征间相关性,然后进行模型训练

    #删除不需要的数据 for data in [data_train,data_test_a]: data.drop(['issueDate','id'],axis=1,inplace=True) https://www.jb51.net/article/143040.htm #纵向用缺失值上面的值替换缺失值 data_train=data.fillna(axis=0,method='ffill') https://blog.csdn.net/lwgkzl/article/details/80948548 x_train=data_train.drop(['isDefault'], axis=1) #计算协方差 #计算相关性 data_corr = x_train.corrwith(data_train.isDefault) #计算相关性 #result = pd.DataFrame(columns=['features', 'corr']) #result['features'] = data_corr.index #result['corr'] = data_corr.values data_corr=data_corr.reset_index() data_corr.columns=['features','corr'] # 特征相关系数可视化 data_numeric = data_train[numerical_fea[1:]] correlation = data_numeric.corr() plt.figure(figsize = (7, 7)) plt.title('Correlation of Numeric Features with Price',y=1,size=16) sns.heatmap(correlation,square = True, vmax=0.8) plt.show()

    features=[f for f in data_train.columns if f not in ['id','issueDate','isDefault']and '_outliers' not in f] x_train=data_train[features] x_test=data_test_a[features] y_train=data_train['isDefault'] def cv_model(clf,train_x,train_y,test_x,clf_name): folds=5 seed=2020 kf=KFold(n_splits=folds,shuffle=True,random_state=seed) train=np.zeros(train_x.shape[0]) test=np.zeros(test_x.shape[0]) cv_scores=[] for i,(train_index,valid_index) in enumerate(kf.split(train_x,train_y)): print('************************************ {} ************************************'.format(str(i+1))) trn_x,trn_y,val_x,val_y=train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index] if clf_name == "lgb": train_matrix = clf.Dataset(trn_x, label=trn_y) valid_matrix = clf.Dataset(val_x, label=val_y) params = { 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': 'auc', 'min_child_weight': 5, 'num_leaves': 2 ** 5, 'lambda_l2': 10, 'feature_fraction': 0.8, 'bagging_fraction': 0.8, 'bagging_freq': 4, 'learning_rate': 0.1, 'seed': 2020, 'nthread': 28, 'n_jobs':24, 'silent': True, 'verbose': -1, } model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=200,early_stopping_rounds=200) val_pred = model.predict(val_x, num_iteration=model.best_iteration) test_pred = model.predict(test_x, num_iteration=model.best_iteration) # print(list(sorted(zip(features, model.feature_importance("gain")), key=lambda x: x[1], reverse=True))[:20]) if clf_name == "xgb": train_matrix = clf.DMatrix(trn_x , label=trn_y) valid_matrix = clf.DMatrix(val_x , label=val_y) params = {'booster': 'gbtree', 'objective': 'binary:logistic', 'eval_metric': 'auc', 'gamma': 1, 'min_child_weight': 1.5, 'max_depth': 5, 'lambda': 10, 'subsample': 0.7, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.7, 'eta': 0.04, 'tree_method': 'exact', 'seed': 2020, 'nthread': 36, "silent": True, } watchlist = [(train_matrix, 'train'),(valid_matrix, 'eval')] model = clf.train(params, train_matrix, num_boost_round=50000, evals=watchlist, verbose_eval=200, early_stopping_rounds=200) val_pred = model.predict(valid_matrix, ntree_limit=model.best_ntree_limit) test_pred = model.predict(test_x , ntree_limit=model.best_ntree_limit) if clf_name == "cat": params = {'learning_rate': 0.05, 'depth': 5, 'l2_leaf_reg': 10, 'bootstrap_type': 'Bernoulli', 'od_type': 'Iter', 'od_wait': 50, 'random_seed': 11, 'allow_writing_files': False} model = clf(iterations=20000, **params) model.fit(trn_x, trn_y, eval_set=(val_x, val_y), cat_features=[], use_best_model=True, verbose=500) val_pred = model.predict(val_x) test_pred = model.predict(test_x) train[valid_index] = val_pred test = test_pred / kf.n_splits cv_scores.append(roc_auc_score(val_y, val_pred)) print(cv_scores) print("%s_scotrainre_list:" % clf_name, cv_scores) print("%s_score_mean:" % clf_name, np.mean(cv_scores)) print("%s_score_std:" % clf_name, np.std(cv_scores)) return train, test def lgb_model(x_train, y_train, x_test): lgb_train, lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb") return lgb_train, lgb_test def xgb_model(x_train, y_train, x_test): xgb_train, xgb_test = cv_model(xgb, x_train, y_train, x_test, "xgb") return xgb_train, xgb_test def cat_model(x_train, y_train, x_test): cat_train, cat_test = cv_model(CatBoostRegressor, x_train, y_train, x_test, "cat") lgb_train,lgb_test = lgb_model(x_train, y_train, x_test) ************************************ 1 ************************************ [LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24 [LightGBM] [Warning] Unknown parameter: silent Training until validation scores don't improve for 200 rounds [200] training's auc: 0.74923 valid_1's auc: 0.729754 [400] training's auc: 0.76495 valid_1's auc: 0.730429 Early stopping, best iteration is: [387] training's auc: 0.763868 valid_1's auc: 0.730485 [0.7304850956718165] ************************************ 2 ************************************ [LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24 [LightGBM] [Warning] Unknown parameter: silent Training until validation scores don't improve for 200 rounds [200] training's auc: 0.749181 valid_1's auc: 0.731532 [400] training's auc: 0.764625 valid_1's auc: 0.731788 Early stopping, best iteration is: [325] training's auc: 0.759049 valid_1's auc: 0.731971 [0.7304850956718165, 0.73197059577665] ************************************ 3 ************************************ [LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24 [LightGBM] [Warning] Unknown parameter: silent Training until validation scores don't improve for 200 rounds [200] training's auc: 0.74861 valid_1's auc: 0.732765 [400] training's auc: 0.764095 valid_1's auc: 0.733776 [600] training's auc: 0.77777 valid_1's auc: 0.733716 Early stopping, best iteration is: [459] training's auc: 0.768269 valid_1's auc: 0.733943 [0.7304850956718165, 0.73197059577665, 0.7339429680725573] ************************************ 4 ************************************ [LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24 [LightGBM] [Warning] Unknown parameter: silent Training until validation scores don't improve for 200 rounds [200] training's auc: 0.749754 valid_1's auc: 0.728774 [400] training's auc: 0.765111 valid_1's auc: 0.729632 [600] training's auc: 0.77826 valid_1's auc: 0.729179 Early stopping, best iteration is: [401] training's auc: 0.765161 valid_1's auc: 0.729646 [0.7304850956718165, 0.73197059577665, 0.7339429680725573, 0.7296461716819063] ************************************ 5 ************************************ [LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24 [LightGBM] [Warning] Unknown parameter: silent Training until validation scores don't improve for 200 rounds [200] training's auc: 0.74861 valid_1's auc: 0.732897 [400] training's auc: 0.764348 valid_1's auc: 0.733471 [600] training's auc: 0.778164 valid_1's auc: 0.733496 Early stopping, best iteration is: [475] training's auc: 0.769796 valid_1's auc: 0.733656 [0.7304850956718165, 0.73197059577665, 0.7339429680725573, 0.7296461716819063, 0.7336557307068068] lgb_scotrainre_list: [0.7304850956718165, 0.73197059577665, 0.7339429680725573, 0.7296461716819063, 0.7336557307068068] lgb_score_mean: 0.7319401123819473 lgb_score_std: 0.0016932184710438153 #保存处理好的数据 data_train.to_csv('train_data_v1.csv',index=False) data_test_a.to_csv('test_data_v1.csv',index=False) data_train.to_csv('data_for_model.csv',index=None)
    Processed: 0.013, SQL: 9