导读:本文使用泰坦尼克数据集,完整介绍数据清洗 7 步的具体操作过程。
作者:zglg
来源:Python与算法社区(ID:alg-channel)
数据清洗 (data cleaning) 是机器学习和深度学习进入算法步前的一项重要任务,我平时比较习惯使用的 7 个步骤,总结如下:
Step1 : read csv
Step2 : preview data
Step3: check null value for every column
Step4: complete null value
Step5: feature engineering
Step 5.1: delete some features
Step 5.2: create new feature
Step6: encode for categories columns
Step 6.1: Sklearn LabelEncode
Step 6.2: Pandas get_dummies
Step 7: check for data cleaning
这不废话吗,第一步就是读入数据。
data_raw = pd.read_csv('../input/titanicdataset-traincsv/train.csv') data_raw结果:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S ... ... ... ... ... ... ... ... ... ... ... ... ... 886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S 887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S 888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S 889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C 890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q 891 rows × 12 columns结果:
<class 'pandas.core.frame.DataFrame'> RangeIndex:891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked count 891.000000891.000000891.000000891891714.000000891.000000891.000000891891.000000204889 unique NaN NaN NaN 8912 NaN NaN NaN 681 NaN 1473 top NaN NaN NaN Hakkarainen, Mr. Pekka Pietari male NaN NaN NaN 1601 NaN G6 S freq NaN NaN NaN 1577 NaN NaN NaN 7 NaN 4644 mean 446.0000000.3838382.308642 NaN NaN 29.6991180.5230080.381594 NaN 32.204208 NaN NaN std 257.3538420.4865920.836071 NaN NaN 14.5264971.1027430.806057 NaN 49.693429 NaN NaN min 1.0000000.0000001.000000 NaN NaN 0.4200000.0000000.000000 NaN 0.000000 NaN NaN 25% 223.5000000.0000002.000000 NaN NaN 20.1250000.0000000.000000 NaN 7.910400 NaN NaN 50% 446.0000000.0000003.000000 NaN NaN 28.0000000.0000000.000000 NaN 14.454200 NaN NaN 75% 668.5000001.0000003.000000 NaN NaN 38.0000001.0000000.000000 NaN 31.000000 NaN NaN max 891.0000001.0000003.000000 NaN NaN 80.0000008.0000006.000000 NaN 512.329200 NaN N结果:
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64Age 列 177 个空值,Cabin 687 个空值,一共才 891 行,估计没啥价值了!Embarked 2 个。
补全操作check:
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 0 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 0 dtype: int64增加一列 FamilySize
data1['FamilySize'] = data1 ['SibSp'] + data1['Parch'] + 1 data1打印结果:
Survived Pclass Name Sex Age SibSp Parch Fare Embarked FamilySize 003 Braund, Mr. Owen Harris male 22.0107.2500 S 2 111 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.01071.2833 C 2 213 Heikkinen, Miss. Laina female 26.0007.9250 S 1 311 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.01053.1000 S 2 403 Allen, Mr. William Henry male 35.0008.0500 S 1 ... ... ... ... ... ... ... ... ... ... ... 88602 Montvila, Rev. Juozas male 27.00013.0000 S 1 88711 Graham, Miss. Margaret Edith female 19.00030.0000 S 1 88803 Johnston, Miss. Catherine Helen "Carrie" female 28.01223.4500 S 4 88911 Behr, Mr. Karl Howell male 26.00030.0000 C 1 89003 Dooley, Mr. Patrick male 32.0007.7500 Q 1 891 rows × 10 columns再创建一列:
data1['IsAlone'] = np.where(data1['FamilySize'] > 1,0,1)再创建一列:
data1['Title'] = data1['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0] data1结果:
Survived Pclass Name Sex Age SibSp Parch Fare Embarked FamilySize IsAlone Title 003 Braund, Mr. Owen Harris male 22.0107.2500 S 20 Mr 111 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.01071.2833 C 20 Mrs 213 Heikkinen, Miss. Laina female 26.0007.9250 S 11 Miss 311 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.01053.1000 S 20 Mrs 403 Allen, Mr. William Henry male 35.0008.0500 S 11 Mr ... ... ... ... ... ... ... ... ... ... ... ... ... 88602 Montvila, Rev. Juozas male 27.00013.0000 S 11 Rev 88711 Graham, Miss. Margaret Edith female 19.00030.0000 S 11 Miss 88803 Johnston, Miss. Catherine Helen "Carrie" female 28.01223.4500 S 40 Miss 88911 Behr, Mr. Karl Howell male 26.00030.0000 C 11 Mr 89003 Dooley, Mr. Patrick male 32.0007.7500 Q 11 Mr 891 rows × 12 columns结果:
Survived Pclass Name Sex Age SibSp Parch Fare Embarked FamilySize IsAlone Title FareCut AgeCut 0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 S 2 0 Mr (-0.001, 7.91] (13.333, 26.667] 1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C 2 0 Mrs (31.0, 512.329] (26.667, 40.0] 2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 S 1 1 Miss (7.91, 14.454] (13.333, 26.667] 3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 S 2 0 Mrs (31.0, 512.329] (26.667, 40.0] 4 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 S 1 1 Mr (7.91, 14.454] (26.667, 40.0] ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 886 0 2 Montvila, Rev. Juozas male 27.0 0 0 13.0000 S 1 1 Rev (7.91, 14.454] (26.667, 40.0] 887 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 30.0000 S 1 1 Miss (14.454, 31.0] (13.333, 26.667] 888 0 3 Johnston, Miss. Catherine Helen "Carrie" female 28.0 1 2 23.4500 S 4 0 Miss (14.454, 31.0] (26.667, 40.0] 889 1 1 Behr, Mr. Karl Howell male 26.0 0 0 30.0000 C 1 1 Mr (14.454, 31.0] (13.333, 26.667] 890 0 3 Dooley, Mr. Patrick male 32.0 0 0 7.7500 Q 1 1 Mr (-0.001, 7.91] (26.667, 40.0] 891 rows × 14 columns使用 Sklearn 的 LabelEncoder
from sklearn.preprocessing import LabelEncoder label = LabelEncoder() data1['Sex_Code'] = label.fit_transform(data1['Sex']) data1['Embarked_Code'] = label.fit_transform(data1['Embarked']) data1['Title_Code'] = label.fit_transform(data1['Title']) data1['AgeBin_Code'] = label.fit_transform(data1['AgeCut']) data1['FareBin_Code'] = label.fit_transform(data1['FareCut']) data1结果 data1 选取某些列,算法模型终于能认出它们了,多不容易!
get_dummies 将长 DataFrame 变为宽 DataFrame:
pd.get_dummies(data1['Sex'])结果:
female male 0 0 1 1 1 0 2 1 0 3 1 0 4 0 1 ... ... ... 886 0 1 887 1 0 888 1 0 889 0 1 890 0 1 891 rows × 2 columns而 LabelEncoder 编码后,仅仅是把 Female 编码为 0, male 编码为 1.
label.fit_transform(data1['Sex']) 0 1 1 0 2 0 3 0 4 1 .. 886 1 887 0 888 0 889 1 890 1 Name: Sex_Code, Length: 891, dtype: int64结果:
<class 'pandas.core.frame.DataFrame'> RangeIndex:891 entries, 0 to 890 Data columns (total 8 columns): Sex_Code 891 non-null int64 Pclass 891 non-null int64 Embarked_Code 891 non-null int64 Title_Code 891 non-null int64 SibSp 891 non-null int64 Parch 891 non-null int64 Age 891 non-null float64 Fare 891 non-null float64 dtypes: float64(2), int64(6) memory usage: 55.8 KB -------------------------------------------------- <class 'pandas.core.frame.DataFrame'> RangeIndex:891 entries, 0 to 890 Data columns (total 29 columns): Pclass 891 non-null int64 SibSp 891 non-null int64 Parch 891 non-null int64 Age 891 non-null float64 Fare 891 non-null float64 FamilySize 891 non-null int64 IsAlone 891 non-null int64 Sex_female 891 non-null uint8 Sex_male 891 non-null uint8 Embarked_C 891 non-null uint8 Embarked_Q 891 non-null uint8 Embarked_S 891 non-null uint8 Title_Capt 891 non-null uint8 Title_Col 891 non-null uint8 Title_Don 891 non-null uint8 Title_Dr 891 non-null uint8 Title_Jonkheer 891 non-null uint8 Title_Lady 891 non-null uint8 Title_Major 891 non-null uint8 Title_Master 891 non-null uint8 Title_Miss 891 non-null uint8 Title_Mlle 891 non-null uint8 Title_Mme 891 non-null uint8 Title_Mr 891 non-null uint8 Title_Mrs 891 non-null uint8 Title_Ms 891 non-null uint8 Title_Rev 891 non-null uint8 Title_Sir 891 non-null uint8 Title_the Countess 891 non-null uint8 dtypes: float64(2), int64(5), uint8(22) memory usage: 68.0 KBGreat !
Done~
划重点????
干货直达????
数据分析必读干货:简单而实用的3大分析方法
国资委发文!10本书讲透数字化时代新机遇
豆瓣8.5以上!这10本书,值得每个人读一遍
6大准则+10道习题,终于有人把怎样选择图表讲明白了
更多精彩????
在公众号对话框输入以下关键词
查看更多优质内容!
PPT | 读书 | 书单 | 硬核 | 干货 | 讲明白 | 神操作
大数据 | 云计算 | 数据库 | Python | 可视化
AI | 人工智能 | 机器学习 | 深度学习 | NLP
5G | 中台 | 用户画像 | 1024 | 数学 | 算法 | 数字孪生
据统计,99%的大咖都完成了这个神操作
????