⼀、透视表
piovt(功能较少)
展开
df
=pd
.read_csv
(r
'data\table.csv')
df
.pivot
(index
='ID',columns
='Gender',values
='Height').head
()
Gender F M
ID
1101 NaN
173.0
1102 192.0 NaN
1103 NaN
186.0
1104 167.0 NaN
1105 159.0 NaN
pivot_table
功能更多,速度更慢
pd
.pivot_table
(df
,index
='ID',columns
='Gender',values
='Height').head
()
Gender F M
ID
1101 NaN
173.0
1102 192.0 NaN
1103 NaN
186.0
1104 167.0 NaN
1105 159.0 NaN
1.aggfunc
对组内进行聚合统计
pd
.pivot_table
(df
,index
='School',columns
='Gender',values
='Height',aggfunc
=['mean','sum']).head
()
mean
sum
Gender F M F M
School
S_1
173.125000 178.714286 1385 1251
S_2
173.727273 172.000000 1911 1548
2.margins
汇总边际状态
pd
.pivot_table
(df
,index
='School',columns
='Gender',values
='Height',aggfunc
=['mean','sum'],margins
=True).head
()
mean
sum
Gender F M All F M All
School
S_1
173.125000 178.714286 175.733333 1385 1251 2636
S_2
173.727273 172.000000 172.950000 1911 1548 3459
All
173.473684 174.937500 174.142857 3296 2799 6095
3.行、列、值都可为多级索引
pd
.pivot_table
(df
,index
=['School','Class'],
columns
=['Gender','Address'],
values
=['Height','Weight'])
Height
... Weight
Gender F M
... F M
Address street_1 street_2 street_4 street_5 street_6 street_7 street_1 street_2
... street_6 street_7 street_1 street_2 street_4 street_5 street_6 street_7
School Class
...
S_1 C_1 NaN
179.5 159.0 NaN NaN NaN
173.0 186.0 ... NaN NaN
63.0 82.0 NaN NaN NaN NaN
C_2 NaN NaN
176.0 162.0 167.0 NaN NaN NaN
... 63.0 NaN NaN NaN NaN
68.0 53.0 NaN
C_3
175.0 NaN NaN
187.0 NaN NaN NaN
195.0 ... NaN NaN NaN
70.0 68.0 NaN NaN
82.0
S_2 C_1 NaN NaN NaN
159.0 161.0 NaN NaN NaN
... 61.0 NaN NaN NaN
71.0 NaN NaN
84.0
C_2 NaN NaN NaN NaN NaN
188.5 175.0 NaN
... NaN
76.5 74.0 NaN
91.0 100.0 NaN NaN
C_3 NaN NaN
157.0 NaN
164.0 190.0 NaN NaN
... 81.0 99.0 NaN NaN
73.0 88.0 NaN NaN
C_4 NaN
176.0 NaN NaN
175.5 NaN NaN NaN
... 57.0 NaN NaN NaN NaN NaN NaN
82.0
[7 rows x
24 columns
]
crosstab(交叉表)
典型用途分组统计(暂时不支持分组统计)
pd
.crosstab
(index
=df
['Address'],columns
=df
['Gender'])
Gender F M
Address
street_1
1 2
street_2
4 2
street_4
3 5
street_5
3 3
street_6
5 1
street_7
3 3
1.values和aggfunc
分组对数据进行聚合操作,这两个参数必须成对出现
pd
.crosstab
(index
=df
['Address'],columns
=df
['Gender'],values
=1,aggfunc
='count')
Gender F M
Address
street_1
1 2
street_2
4 2
street_4
3 5
street_5
3 3
street_6
5 1
street_7
3 3
二、其他变形方法
melt
将展开的数据压缩
pivoted
=df
.pivot
(index
='ID',columns
='Gender',values
='Math').head
()
stacked
=pivoted
.reset_index
().melt
(id_vars
=['ID'],value_vars
=['F','M'],value_name
='Math').dropna
().set_index
('ID').sort_index
()
Gender Math
ID
1101 M
34.0
1102 F
32.5
1103 M
87.2
1104 F
80.4
1105 F
84.8
压缩与展开
1.stack
压缩,两个参数level和dropna 可看做将横向的索引放到纵向,类似melt,参数level可指定变化的列索引是哪⼀层
df_s
= pd
.pivot_table
(df
,index
=['Class','ID'],columns
='Gender',values
=['Height','Weight'])
df_stacked
= df_s
.stack
()
df_stacked
.groupby
('Class').head
(2)
Height Weight
Class
ID Gender
C_1
1101 M
173.0 63.0
1102 F
192.0 73.0
C_2
1201 M
188.0 68.0
1202 F
176.0 94.0
C_3
1301 M
161.0 68.0
1302 F
175.0 57.0
C_4
2401 F
192.0 62.0
2402 M
166.0 82.0
2. unstack
功能类似pivot_table
df_stacked
.unstack
()
df_stacked
.unstack
().equals
(df_s
)
True
三、哑变量与因子化
1. Dummy Variable(哑变量)
df_d
= df
[['Class','Gender','Weight']]
将上面的表格前两列转为哑变量
pd
.get_dummies
(df_d
[['Class','Gender']]).head
()
Class_C_1 Class_C_2 Class_C_3 Class_C_4 Gender_F Gender_M
0 1 0 0 0 0 1
1 1 0 0 0 1 0
2 1 0 0 0 0 1
3 1 0 0 0 1 0
4 1 0 0 0 1 0
2. factorize方法
用于自然数编码并且缺失值会被记做-1,其中sort参数表示是否排序后赋值
codes
, uniques
= pd
.factorize
(['b', None, 'a', 'c', 'b'], sort
=True)
display
(codes
)
display
(uniques
)
[ 1 -1 0 2 1]
['a' 'b' 'c']
练习一
a
res
=pd
.pivot_table
(df
,index
=['State','COUNTY','SubstanceName'],columns
='YYYY',values
='DrugReports',fill_value
='-')
YYYY
2010 2011 2012 2013 2014 2015 2016 2017
State COUNTY SubstanceName
KY ADAIR Buprenorphine
- 3 5 4 27 5 7 10
Codeine
- - 1 - - - - 1
Fentanyl
- - 1 - - - - -
Heroin
- - 1 2 - 1 - 2
Hydrocodone
6 9 10 10 9 7 11 3
res
.reset_index
().rename_axis
(columns
={'YYYY':''})
State COUNTY SubstanceName
2010 2011 2012 2013 2014 2015 2016 2017
0 KY ADAIR Buprenorphine
- 3 5 4 27 5 7 10
1 KY ADAIR Codeine
- - 1 - - - - 1
2 KY ADAIR Fentanyl
- - 1 - - - - -
3 KY ADAIR Heroin
- - 1 2 - 1 - 2
4 KY ADAIR Hydrocodone
6 9 10 10 9 7 11 3
b
melted
= result
.melt
(id_vars
=result
.columns
[:3],value_vars
=result
.columns
[-8:],var_name
='YYYY',value_name
='DrugReports').query
('DrugReports != "-"')
res
= melted
.sort_values
(by
=['State','COUNTY','YYYY','SubstanceName']).reset_index
().drop
(columns
='index')
State COUNTY SubstanceName YYYY DrugReports
0 KY ADAIR Hydrocodone
2010 6
1 KY ADAIR Methadone
2010 1
2 KY ADAIR Buprenorphine
2011 3
3 KY ADAIR Hydrocodone
2011 9
4 KY ADAIR Morphine
2011 2
练习二
a
pd
.pivot_table
(df
,index
=['日期','时间','维度','经度'],columns
='方向',values
=['烈度','深度','距离'],fill_value
='-').stack
(level
=0).rename_axis
(index
={None:'地震参数'})
方向 east north north_east north_west south south_east south_west west
日期 时间 维度 经度 参数
1912.08.09 12:29:00 AM
40.6 27.2 深度
- - - - - 16 - -
烈度
- - - - - 6.7 - -
距离
- - - - - 4.3 - -
1912.08.10 12:23:00 AM
40.6 27.1 深度
- - - - - - 15 -
烈度
- - - - - - 6 -
b
df_result
= res
.unstack
().stack
(0)[(~(res
.unstack
().stack
(0)=='-')).any(1)].reset_index
()
df_result
.columns
.name
=None
df_result
= df_result
.sort_index
(axis
=1).astype
({'深度':'float64','烈度':'float64','距离':'float64'})
方向 日期 时间 深度 烈度 经度 维度 距离
0 south_east
1912.08.09 12:29:00 AM
16.0 6.7 27.2 40.6 4.3
1 south_west
1912.08.10 12:23:00 AM
15.0 6.0 27.1 40.6 2.0
2 south_west
1912.08.10 12:30:00 AM
15.0 5.2 27.1 40.6 2.0
3 south_east
1912.08.11 12:19:04 AM
30.0 4.9 27.2 40.6 4.3
4 south_west
1912.08.11 12:20:00 AM
15.0 4.5 27.1 40.6 2.0