pandas模块的使用(二)

科技2022-07-17 138

数据合并之join:

join:默认情况下他是把行索引相同的数据合并到一起 In [6]: t1 = pd.DataFrame(np.zeros((2,5)),index=["A","B"],columns=list("VWXYZ")) In [7]: t1 Out[7]: V W X Y Z A 0.0 0.0 0.0 0.0 0.0 B 0.0 0.0 0.0 0.0 0.0 In [8]: t2 = pd.DataFrame(np.ones((3,4)),index=list("ABC"),columns=list("0123")) In [9]: t2 Out[9]: 0 1 2 3 A 1.0 1.0 1.0 1.0 B 1.0 1.0 1.0 1.0 C 1.0 1.0 1.0 1.0 In [10]: t1.join(t2) Out[10]: V W X Y Z 0 1 2 3 A 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 B 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 In [11]: t2.join(t1) Out[11]: 0 1 2 3 V W X Y Z A 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 B 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 C 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN 可以看到join是将index相同的行进行了合并,以左操作数为基础进行合并

数据合并之merge:

In [25]: t1 Out[25]: V W X Y Z A 0.0 0.0 c 0.0 0.0 B 0.0 0.0 d 0.0 0.0 In [26]: t2 Out[26]: M N P Q O A 1.0 1.0 1.0 1.0 a B 1.0 1.0 1.0 1.0 b C 1.0 1.0 1.0 1.0 c In [27]: t1.merge(t2,left_on="X",right_on="O") # 默认的合并方式inner,交集 Out[27]: V W X Y Z M N P Q O 0 0.0 0.0 c 0.0 0.0 1.0 1.0 1.0 1.0 c In [28]: t1.merge(t2,left_on="X",right_on="O",how="inner") # 内连接 Out[28]: V W X Y Z M N P Q O 0 0.0 0.0 c 0.0 0.0 1.0 1.0 1.0 1.0 c In [29]: t1.merge(t2,left_on="X",right_on="O",how="outer") # 外连接 merge outer,并集,NaN补全 Out[29]: V W X Y Z M N P Q O 0 0.0 0.0 c 0.0 0.0 1.0 1.0 1.0 1.0 c 1 0.0 0.0 d 0.0 0.0 NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 a 3 NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 b In [30]: t1.merge(t2,left_on="X",right_on="O",how="left") # 左连接 merge left,左边为准,NaN补全 Out[30]: V W X Y Z M N P Q O 0 0.0 0.0 c 0.0 0.0 1.0 1.0 1.0 1.0 c 1 0.0 0.0 d 0.0 0.0 NaN NaN NaN NaN NaN In [31]: t1.merge(t2,left_on="X",right_on="O",how="right") # 右连接 merge right,右边为准,NaN补全 Out[31]: V W X Y Z M N P Q O 0 NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 a 1 NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 b 2 0.0 0.0 c 0.0 0.0 1.0 1.0 1.0 1.0 c 可以看到merge是以指定的columns对应的两个列中元素相同的连接为一行

例题:

现在我们有一组关于全球星巴克店铺的统计数据,如果我想知道美国的星巴克数量和中国的哪个多,或者我想知道中国每个省份星巴克的数量的情况,那么应该怎么办？

要统计美国和中国的星巴克的数量,我们应该怎么做？

数据来源：https://www.kaggle.com/starbucks/store-locations/data

数据格式:

Brand Store Number Store Name Ownership Type Street Address City State/Province Country Postcode Phone Number Timezone Longitude Latitude Starbucks 47370-257954 Meritxell, 96 Licensed Av. Meritxell, 96 Andorra la Vella 7 AD AD500 376818720 GMT+1:00 Europe/Andorra 1.53 42.51 Starbucks 22331-212325 Ajman Drive Thru Licensed 1 Street 69, Al Jarf Ajman AJ AE GMT+04:00 Asia/Dubai 55.47 25.42 Starbucks 47089-256771 Dana Mall Licensed Sheikh Khalifa Bin Zayed St. Ajman AJ AE GMT+04:00 Asia/Dubai 55.47 25.39

代码:

import pandas as pd df = pd.read_csv("./starbucks_store_worldwide.csv") # print(df.info()) # print(df.head(1)) # 按照国家进行分组(聚合) country_info = df.groupby(by="Country") # 遍历输出分组后的信息 for i,j in country_info: print("-"*50) print(i) print("*"*50) print(j) # 计算分组后每一个国家牌子的数量 country_num = country_info["Brand"].count() print(country_num) df[df["Country"]=="US"] # 分别输出美国和中国的星巴克Brand的数量 print(country_num["US"]) print(country_num["CN"]) # 统计中国每个省店铺的数量 china_data = df[df["Country"] == "CN"] # 按照省分组 grouped = china_data.groupby(by="State/Province").count()["Brand"] print(grouped) # 将数据按照多个条件分组 grouped = df["Brand"].groupby(by=[df["Country"],df["State/Province"]]).count() print(grouped) print(type(grouped)) # 按多条件进行分组,返回DataFrame grouped1 = df[["Brand"]].groupby(by=[df["Country"],df["State/Province"]]).count() grouped2 = df.groupby(by=[df["Country"],df["State/Province"]]).count()[["Brand"]] grouped3 = df.groupby(by=[df["Country"],df["State/Province"]])[["Brand"]].count() print(grouped1,type(grouped1)) # <class 'pandas.core.frame.DataFrame'> print("*"*50) print(grouped2,type(grouped2)) # <class 'pandas.core.frame.DataFrame'> print("*"*50) print(grouped3,type(grouped3)) # <class 'pandas.core.frame.DataFrame'>

分组和聚合:

在pandas中类似的分组的操作我们有很简单的方式来完成 df.groupby(by="columns_name") grouped = df.groupby(by="columns_name") grouped是一个DataFrameGroupBy对象,是可迭代的 grouped中的每一个元素是一个元组元组里面是（索引(分组的值),分组之后的DataFrame）如果我们需要对国家和省份进行分组统计,应该怎么操作呢？ grouped = df.groupby(by=[df["Country"],df["State/Province"]]) 很多时候我们只希望对获取分组之后的某一部分数据,或者说我们只希望对某几列数据进行分组,这个时候我们应该怎么办呢？获取分组之后的某一部分数据: df.groupby(by=["Country","State/Province"])["Country"].count() 对某几列数据进行分组: df["Country"].groupby(by=[df["Country"],df["State/Province"]]).count() 观察结果,由于只选择了一列数据,所以结果是一个Series类型 t1 = df[["Country"]].groupby(by=[df["Country"],df["State/Province"]]).count()t2 = df.groupby(by=["Country","State/Province"])[["Country"]].count() 以上的两条命令结果一样和之前的结果的区别在于当前返回的是一个DataFrame类型. DataFrameGroupBy对象有很多经过优化的方法: 函数名说明 count 分组中非NA值的数量 sum 非NA值的和 mean 非NA值的平均值 median 非NA值的算术中位数 std、var 无偏（分母为n-1）标准差和方差 min、max 非NA值的最小值和最大值

索引和复合索引:

简单的索引操作：获取index: df.index 指定index: df.index = ['x','y'] 重新设置index: df.reindex(list("abcedf")) # 新的index对应的值都为NaN 指定某一列作为index: df.set_index("Country",drop=False) # drop为False时在数据中保留原来的列返回index的唯一值: df.set_index("Country").index.unique() 假设a为一个DataFrame,那么当a.set_index(["c","d"])即设置两个索引的时候是什么样子的结果呢？ a = pd.DataFrame({'a': range(7),'b': range(7, 0, -1),'c': ['one','one','one','two','two','two', 'two'],'d': list("hjklmno")})

Series复合索引:

In [52]: a Out[52]: a b c d 0 0 7 one h 1 1 6 one j 2 2 5 one k 3 3 4 two l 4 4 3 two m 5 5 2 two n 6 6 1 two o In [53]: X = a.set_index(["c","d"])["a"] In [54]: X Out[54]: c d one h 0 j 1 k 2 two l 3 m 4 n 5 o 6 Name: a, dtype: int64 In [55]: X["one","h"] # Series符合索引取值,直接在括号里面写索引就行 Out[55]: 0 In [10]: type(X) Out[10]: pandas.core.series.Series In [11]: X.swaplevel() # 交换索引的里外层 Out[11]: d c h one 0 j one 1 k one 2 l two 3 m two 4 n two 5 o two 6 Name: a, dtype: int64 In [12]: X.swaplevel()["h"] # 此时可以直接取"h"索引 Out[12]: c one 0 Name: a, dtype: int64 In [13]: X.index.levels Out[13]: FrozenList([['one', 'two'], ['h', 'j', 'k', 'l', 'm', 'n', 'o']]) In [14]: X.swaplevel().index.levels Out[14]: FrozenList([['h', 'j', 'k', 'l', 'm', 'n', 'o'], ['one', 'two']]) In [18]: a Out[18]: a b c d 0 0 7 one h 1 1 6 one j 2 2 5 one k 3 3 4 two l 4 4 3 two m 5 5 2 two n 6 6 1 two o In [19]: x = a.set_index(["c","d"])[["a"]] # pandas.core.frame.DataFrame In [20]: x Out[20]: a c d one h 0 j 1 k 2 two l 3 m 4 n 5 o 6 In [21]: x.loc["one"] Out[21]: a d h 0 j 1 k 2 In [22]: x.loc["one"].loc["h"] Out[22]: a 0 Name: h, dtype: int64

根据上个例题的数据:

使用matplotlib呈现出店铺总数排名前10的国家使用matplotlib呈现出中国每个城市的店铺数量

代码1:

import pandas as pd from matplotlib import pyplot as plt # 准备数据 df = pd.read_csv("./starbucks_store_worldwide.csv") # 提取数据 country_data = df.groupby(by="Country")["Brand"].count().sort_values(ascending=False)[:10] # 设置图片大小 plt.figure(figsize=(20,8),dpi=80) # 画条型图 plt.bar(range(len(country_data)),country_data,width=0.4,color="pink") # 设置x刻度 plt.xticks(range(len(country_data)),country_data.index) # 显示图片 plt.show()

效果图:

代码2:

import pandas as pd from matplotlib import pyplot as plt import matplotlib font = {'family' : 'WenQuanYi Micro Hei', 'weight' : 'bold', 'size' : '10'} # 设置中文字体 matplotlib.rc("font",**font) # 准备数据 df = pd.read_csv("./starbucks_store_worldwide.csv") print(df.info()) # 提取数据 df = df[df["Country"]=="CN"] china_data = df.groupby(by="City")["Brand"].count().sort_values(ascending=False)[:25] print(china_data) # 设置图片大小 plt.figure(figsize=(20,8),dpi=80) # 绘制直方图 plt.bar(range(25),china_data.values,width=0.4,color="green") # 设置x刻度 plt.xticks(range(25),china_data.index) # 显示图片 plt.show()

效果图:

例题:

现在我们有全球排名靠前的10000本书的数据，那么请统计一下下面几个问题：

不同年份书的数量不同年份书的平均评分情况

收据来源：https://www.kaggle.com/zygmunt/goodbooks-10k

数据格式:

代码:

import pandas as pd from matplotlib import pyplot as plt # 准备数据 df = pd.read_csv("./books.csv") # print(df.info()) # print(df.head(1)) # 去除空数据所在行 # df = df[pd.notnull(df["original_publication_year"])] # 提取数据 # data_book_count = df.groupby(by="original_publication_year").count()["title"] data_book_avg = df["average_rating"].groupby(by=df["original_publication_year"]).mean() _x = data_book_avg.index _y = data_book_avg.values # 设置图片大小 plt.figure(figsize=(20,8),dpi=80) # 画折线图 plt.plot(range(len(_x)),_y) # 设置x刻度 plt.xticks(list(range(len(_x)))[::10],_x[::10],rotation=45) # 显示 plt.show()

效果图:

Processed: 0.011, SQL: 8