pd.concat(obj,axis = 0, join = ‘outer’) axis = 0 列, axis = 1 行 concat函数可按照其他轴的逻辑关系进行合并,默认join=‘outer’,还有一个可取的值是’inner’
字段的含义
cand_nm – 接受捐赠的候选人姓名 contbr_nm – 捐赠人姓名 contbr_st – 捐赠人所在州 contbr_employer – 捐赠人所在公司 contbr_occupation – 捐赠人职业 contb_receipt_amt – 捐赠数额(美元) contb_receipt_dt – 收到捐款的日期
<class ‘pandas.core.frame.DataFrame’> Int64Index: 1001732 entries, 0 to 1730 Data columns (total 7 columns): cand_nm 1001732 non-null object contbr_nm 1001732 non-null object contbr_st 1001728 non-null object contbr_employer 988003 non-null object contbr_occupation 993302 non-null object contb_receipt_amt 1001732 non-null float64 contb_receipt_dt 1001732 non-null object dtypes: float64(1), object(6) memory usage: 61.1+ MB
#查看数值类型字段的行数、均值、标准差、最小值、分位数、最大值 data.describe()– contb_receipt_amt count 1.001732e+06 mean 2.982358e+02 std 3.749665e+03 min -3.080000e+04 25% 3.500000e+01 50% 1.000000e+02 75% 2.500000e+02 max 2.014491e+06
<class ‘pandas.core.frame.DataFrame’> Int64Index: 1001732 entries, 0 to 1730 Data columns (total 7 columns): cand_nm 1001732 non-null object contbr_nm 1001732 non-null object contbr_st 1001728 non-null object contbr_employer 1001732 non-null object contbr_occupation 1001732 non-null object contb_receipt_amt 1001732 non-null float64 contb_receipt_dt 1001732 non-null object dtypes: float64(1), object(6) memory usage: 61.1+ MB
空值填充函数 fillna(method,inplace,limit ,axis)
inplace 默认为Falsemethod:{pad, fiill, backfill, bfill None,dict} pad / ffill - 用前一个非缺失元素填充空值 :df.fillna(method=‘ffill’) backfill / bfil - 用后一个非缺失元素填充空值 :df.fillna(method=‘bfill’) None - 不指定任何参数: df.fillna(100) dict - 字典填充(指定对应列空值填充的内容):df.fiillna({0:10,1:20})limit 指定每列空值填充元素个数: df.fillna(method=‘bfill’,limit = 2)axis 指定填充方向 axis = 0 列, axis = 1 行美国大选一般是民主党和共和党之争,虽然数据中没有党派这个字段,但是通过候选人名称即cand_nm,可以得到对应的党派信息, 利用字典映射进行数据转换
#利用字典映射进行转换:党派分析 data['cand_nm'].unique() parties = {'Bachmann, Michelle': 'Republican', 'Cain, Herman': 'Republican', 'Gingrich, Newt': 'Republican', 'Huntsman, Jon': 'Republican', 'Johnson, Gary Earl': 'Republican', 'McCotter, Thaddeus G': 'Republican', 'Obama, Barack': 'Democrat', 'Paul, Ron': 'Republican', 'Pawlenty, Timothy': 'Republican', 'Perry, Rick': 'Republican', "Roemer, Charles E. 'Buddy' III": 'Republican', 'Romney, Mitt': 'Republican', 'Santorum, Rick': 'Republican'} # 增加一列party存储党派信息 data['party'] = data['cand_nm'].map(parties) data['party'].value_counts()Out[20]: array([‘Bachmann, Michelle’, ‘Romney, Mitt’, ‘Obama, Barack’, “Roemer, Charles E. ‘Buddy’ III”, ‘Pawlenty, Timothy’, ‘Johnson, Gary Earl’, ‘Paul, Ron’, ‘Santorum, Rick’, ‘Cain, Herman’, ‘Gingrich, Newt’, ‘McCotter, Thaddeus G’, ‘Huntsman, Jon’, ‘Perry, Rick’], dtype=object)
Out[21]: Democrat 593747 Republican 407985 Name: party, dtype: int64
groupby groupby函数主要的作用是进行数据的分组以及分组后地组内运算 1.df.groupby([df[col1],df[col2])…,df[coln]).func() eg. df.goupby(‘性别’)[‘身高’].describe().unstack() **根据性别分组后对身高进行统计 **unstack() – 索引重排 链接: https://www.cnblogs.com/Yanjy-OnlyOne/p/11217802.html)
cand_nm Obama, Barack 1.358776e+08 Romney, Mitt 8.833591e+07 Paul, Ron 2.100962e+07 Perry, Rick 2.030675e+07 Gingrich, Newt 1.283277e+07 Santorum, Rick 1.104316e+07 Cain, Herman 7.101082e+06 Pawlenty, Timothy 6.004819e+06 Huntsman, Jon 3.330373e+06 Bachmann, Michelle 2.711189e+06 Johnson, Gary Earl 5.669616e+05 Roemer, Charles E. ‘Buddy’ III 3.730099e+05 McCotter, Thaddeus G 3.903000e+04 Name: contb_receipt_amt, dtype: float64
仅关注Obama, Barack,Romney, Mitt
#候选人筛选 data_vs = data[data['cand_nm'].isin(['Obama, Barack','Romney, Mitt'])].copy()利用cut函数根据出资额大小将数据离散化到多个面元中
bins = np.array([0,1,10,100,1000,10000,100000,1000000,10000000]) labels = pd.cut(data_vs['contb_receipt_amt'],bins) labels410 (10, 100] 411 (100, 1000] 412 (100, 1000] 413 (10, 100] 414 (10, 100] 415 (10, 100] 416 (100, 1000] 417 (10, 100] 418 (100, 1000] 419 (10, 100] 420 (10, 100] 421 (100, 1000] … 201383 (1, 10] 201384 (10, 100] 201385 (100, 1000]
通过pivot_table根据党派和职业对数据进行聚合,然后过滤掉总出资不足200万美元的数据
#按照党派、职业对赞助金额进行汇总,类似excel中的透视表操作,聚合函数为sum by_occp = data.pivot_table('contb_receipt_amt',index = 'contbr_occupation',columns = 'party',aggfunc = 'sum') over_2mm = by_occp[by_occp.sum(1)>200] pritny(over_2mm) over_2mm.plot(kind = 'bar')pivot_table 的用法 pivot_table(data, values=None, index=None, columns=None,aggfunc=‘mean’, fill_value=None, margins=False, dropna=True, margins_name=‘All’) 1.Values可以对需要的计算数据进行筛选,即计算的对象 2.index为行索引,可有多个:pd.pivot_table(df,index=[u’对手’,u’主客场’]) 3.columns 为分割数据的可选方式 4.aggfunc 是处理函数,min/max/sum/mean等 参考链接: https://www.cnblogs.com/Yanjy-OnlyOne/p/11195621.html
分析Obama和Romney总出资最高的职业和雇主
def get_top_amount(group,key,n=5): totals = group.groupby(key)['contb_receipt_amt'].sum() return totals.sort_values(ascending=False)[:n] grouped = data_vs.groupby('cand_nm') grouped.apply(get_top_amount,'contbr_occupation',7)等价于
grouped = data_vs.groupby(['cand_nm','contbr_occupation']).sum() grouped.sort_values(by = ['cand_nm','contb_receipt_amt'],ascending = False).groupby('cand_nm').head(7)可以看出,Obama更受精英群体(律师、医生、咨询顾问)的欢迎,Romney则得到更多企业家或企业高管的支持
前面已经利用pd.cut()函数将出资额大小分到不同的区间里,现在对每个区间进行分组分析(数量、金额总和)
group_size = data_vs.groupby(['cand_nm',labels]).size().unstack(0) group_sum = data_vs.groupby(['cand_nm',labels]).sum().unstack(0) group_size.plot(kind = 'bar') group_sum.plot(kind = 'bar')
stack() 函数 - 不堆叠 unstack() 函数 - 堆叠 参数为选择索引重排的列
做百分比堆积图:
#堆积效果图 normed_sums = group_sum.div(group_sum.sum(axis = 1),axis = 0) normed_sums.plot(kind = 'bar',stacked = True)str与datetime格式之间的相互转化 常用时间格式:datetime.datetime, datetime.date, str datetime.now() -> datetime.datetime(2020, 10, 9, 17, 41, 2, 968610) 1.str转时间格式 1.1 str 转datetime.datetime t = ‘2020-02-01’ datetime.strptime(t, ‘%Y-%m-%d’) -> datetime.datetime(2020, 2, 1, 0, 0) 1.2str转datetime.date 先将str转为datetime.datetime,再转换为datetime.date t = ‘2020-02-01’ t2=datetime.strptime(t,’%Y-%m-%d’)->datetime.datetime(2020, 2, 1, 0, 0) datetime.date(t2) -> datetime.date(2020, 2, 1)
2.时间格式转str 2.1 datetime.datetime 转str datetime.now().strftime(’%Y-%m-%d’) -> ‘yyyy-mm-dd’ datetime.now().strftime(’%Y%m%d’) -> ‘yyyymmdd’ 2.2 datetime.date 转 str t = datetime.date(2020, 2, 1) str(t) -> ‘2020-02-01’
3. 字符串时间之间的转换 yyyymmdd -> yyyy-mm-dd先转换为时间格式再转回字符串 t = ‘20201009’ datetime.strptime(a,’%Y%m%d’).strftime(’%Y-%m-%d’) 或者 t.replace("-","")
4. 时间格式之间的转换 4.1 datetime.date 转datetime.datetime 先转为str t = datetime.date(2020,10,9) datetime.strptime(str(t),’%Y-%m-%d’) 4.2 datetime.datetime 转换为datetime.date datetime.now().date()
重采样(Resampling)指的是把时间序列的频度变为另一个频度的过程。把高频度的数据变为低频度叫做降采样(downsampling),resample会对数据进行分组,然后再调用聚合函数。这里我们把频率从每日转换为每月,属于高频转低频的降采样。
import matplotlib.pyplot as plt vs_time = data_vs.groupby('cand_nm').resample('M')['cand_nm'].count() vs_time2 = vs_time.unstack(0) #figsize调整plot的大小 fig1, ax1 = plt.subplots(figsize=(32,8)) #vs_time2.plot(kind = 'area') #ax调整大小,alpha调整颜色深度 vs_time2.plot(kind = 'area',ax = ax1,alpha = 0.6) plt.show()从上图可知,越接近大选,赞助热情越
