文章目录
1. 避免SettingWithCopyWarning报警2. axis参数怎么理解3. 索引相关操作4. 分层索引MultiIndex5. merge:关联多表数据6. concat:实现数据的合并7. 批量拆分或合并excel8. 拆分一列内容变多列9. Pandas的Categorical数据类型可以降低数据存储提升计算速度10. 调整列的顺序11. explode实现一列列表内容变多行12. 复杂多列到多行转换
1. 避免SettingWithCopyWarning报警
condition
= df
["ymd"].str.startswith
("2018-03")
df
[condition
]["wen_cha"] = df
["bWendu"]-df
["yWendu"]
df
.loc
[condition
, "wen_cha"] = df
["bWendu"]-df
["yWendu"]
df2
= df
[condition
].copy
()
df2
["wen_cha"] = df
["bWendu"]-df
["yWendu"]
2. axis参数怎么理解
df
= pd
.DataFrame
(
np
.arange
(12).reshape
(3,4),
columns
=['A', 'B', 'C', 'D']
)
df
.drop
("A", axis
=1)
df
.drop
(1, axis
=0)
df
.mean
(axis
=0)
df
.mean
(axis
=1)
按哪个axis,就是这个axis要动起来
(类似被
for遍历
),另一个axis保持不动,就像梳子
3. 索引相关操作
df
.set_index
("userId", inplace
=True, drop
=False)
df_shuffle
.index
.is_monotonic_increasing
df_shuffle
.index
.is_unique
df_sorted
= df_shuffle
.sort_index
()
4. 分层索引MultiIndex
分层索引:在一个轴向上拥有多个索引层级,可以表达更高维度数据的形式更方便的进行数据筛选,如果有序则性能更好多个key经过groupby等操作的结果是分层索引
Series的分层索引MultiIndex
ser
= stocks
.groupby
(['公司', '日期'])['收盘'].mean
()
ser
.index
ser
.unstack
()
ser
.reset_index
()
ser
.loc
[('BIDU', '2019-10-02')]
DataFrame的多层索引MultiIndex
stocks
.set_index
(['公司', '日期'], inplace
=True)
stocks
.index
stocks
.sort_index
(inplace
=True)
stocks
.loc
[('BIDU', '2019-10-02'), :]
stocks
.loc
[['BIDU', 'JD'], :]
stocks
.loc
[(['BIDU', 'JD'], '2019-10-03'), :]
stocks
.loc
[(slice(None), ['2019-10-02', '2019-10-03']), :]
stocks
.reset_index
()
5. merge:关联多表数据
Pandas的Merge,相当于Sql的Join,将不同的表按key关联到一个表
如果left_on和right_on是列表,则可以按照多列相同而合并
merge的语法:
pd
.merge
(left
, right
, how
='inner', on
=None, left_on
=None, right_on
=None,
left_index
=False, right_index
=False, sort
=True, suffixes
=('_x', '_y'),
copy
=True, indicator
=False, validate
=None)
left,right:要merge的dataframe或者有name的Serieshow:join类型,‘left’, ‘right’, ‘outer’, ‘inner’(默认)on:join的key,left和right都需要有这个key,相当于left_on和right_on赋值同一个keyleft_on:left的df或者series的keyright_on:right的df或者seires的keyleft_index,right_index:使用index而不是普通的column做joinsuffixes:两个元素的后缀,如果需关联的列有重名,自动添加后缀,默认是(’_x’, ‘_y’)
pd
.merge
(left
, right
, on
='key', suffixes
=('_left', '_right'))
pd
.merge
(left
, right
, left_on
='key', right_on
='key')
理解left join、right join、inner join、outer join的区别
6. concat:实现数据的合并
可以批量多个Pandas对象(DataFrame/Series);给DataFrame添加行;给DataFrame添加列
concat语法:
pandas
.concat
(objs
, axis
=0, join
='outer', ignore_index
=False)
pd
.concat
(
[pd
.DataFrame
([i
], columns
=['A']) for i
in range(5)],
ignore_index
=True
)
7. 批量拆分或合并excel
将一个大Excel等份拆成多个Excel
user_names
= ["xiao_shuai", "xiao_wang", "xiao_ming", "xiao_lei"]
split_size
= total_row_count
// len(user_names
)
if total_row_count
% len(user_names
) != 0:
split_size
+= 1
df_subs
= []
for idx
, user_name
in enumerate(user_names
):
begin
= idx
*split_size
end
= begin
+split_size
df_sub
= df_source
.iloc
[begin
:end
]
df_subs
.append
((idx
, user_name
, df_sub
))
for idx
, user_name
, df_sub
in df_subs
:
file_name
= f
"{splits_dir}/crazyant_blog_articles_{idx}_{user_name}.xlsx"
df_sub
.to_excel
(file_name
, index
=False)
合并多个小Excel到一个大Excel
import os
excel_names
= []
for excel_name
in os
.listdir
(splits_dir
):
excel_names
.append
(excel_name
)
excel_names
df_list
= []
for excel_name
in excel_names
:
excel_path
= f
"{splits_dir}/{excel_name}"
df_split
= pd
.read_excel
(excel_path
)
username
= excel_name
.replace
("crazyant_blog_articles_", "").replace
(".xlsx", "")[2:]
print(excel_name
, username
)
df_split
["username"] = username
df_list
.append
(df_split
)
df_merged
= pd
.concat
(df_list
)
df_merged
.to_excel
(f
"{work_dir}/crazyant_blog_articles_merged.xlsx", index
=False)
8. 拆分一列内容变多列
def split_func(line
):
line
["姓名"], line
["性别"], line
["年龄"], line
["城市"] = line
["数据"].split
(":")
return line
df
= df
.apply(split_func
, axis
=1)
9. Pandas的Categorical数据类型可以降低数据存储提升计算速度
df
.info
(memory_usage
="deep")
%timeit df
.groupby
("Gender").size
()
df_cat
["Gender"] = df_cat
["Gender"].astype
("category")
df_cat
.info
(memory_usage
="deep")
%timeit df_cat
.groupby
("Gender").size
()
10. 调整列的顺序
常规方法
df_merge
.columns
new_columns
= df_merge
.columns
.to_list
()
for name
in ["姓名", "性别"][::-1]:
new_columns
.remove
(name
)
new_columns
.insert
(new_columns
.index
("学号")+1, name
)
new_columns
df_merge
= df_merge
.reindex
(columns
=new_columns
)
自己的方法
df_sm2_cols
= list(df_sm2
)
df_sm2_cols
.insert
(0, df_sm2_cols
.pop
(df_sm2_cols
.index
('Site')))
df_sm2
= df_sm2
.loc
[:, df_sm2_cols
]
11. explode实现一列列表内容变多行
语法:pandas.DataFrame.explode(column) 将dataframe的一个list-like的元素按行复制,index索引随之复制
df
["Genre"] = df
["Genres"].map(lambda x
:x
.split
("|"))
print(df
["Genre"][0])
print(type(df
["Genre"][0]))
df_new
= df
.explode
("Genre")
12. 复杂多列到多行转换
分析:
一行变多行,可以用explode实现要使用explode,需要先将多列变成一列注意有的列为空,需要做空值过滤
df
.columns
merge_names
= list(df
.loc
[:, "Supplier":].columns
.values
)
def merge_cols(x
):
"""
x是一个行Series,把它们按分隔符合并
"""
x
= x
[x
.notna
()]
y
= x
.values
result
= []
for idx
in range(0, len(y
), 2):
result
.append
(f
"{y[idx]}|{y[idx+1]}")
return "#".join
(result
)
df
["merge"] = df
.loc
[:, "Supplier":].apply(merge_cols
, axis
=1)
df
.drop
(merge_names
, axis
=1, inplace
=True)
df
df
["merge"] = df
["merge"].str.split
("#")
df_explode
= df
.explode
("merge")
df_explode
.drop
("merge", axis
=1, inplace
=True)
df_explode
df_explode
["Supplier"]=df_explode
["merge"].str.split
("|").str[0]
df_explode
["Supplier PN"]=df_explode
["merge"].str.split
("|").str[1]
df_explode