数据分析3---pandas

    科技2024-11-15  15

    文章目录

    why pandaspandas.SeriesSeries的创建修改数据类型(dtype)Series的切片和索引Series的属性 pandas读取外部数据Pandas.DataFrameDataFrame(二维数组)的创建DataFrame的基本属性DataFrame的索引排序、布尔索引、字符串方法 pandas缺失数据处理判断是否有NaN(pd.isnull(t))删除NaN所在的行列填充NaN(t.fillna()) pandas常用统计方法例1:例2:例3(电影类别统计): 分组与聚合

    why pandas

    numpy只能处理数值型数据,pandas除了处理数值型数据之外(基于numpy),还能处理其他类型的数据(如字符串等)。pandas的常用数据类型: (1)Series,一维,带标签数组; (2)DataFrame,二维,Series容器。

    pandas.Series

    Series的创建

    通过列表或range/np.arange创建Series a = pd.Series([1, 2, 3]) print(a) print(type(a)) # <class 'pandas.core.series.Series'> # 0 1 # 1 2 # 2 3 第一列是标签/索引,第二列才是数据 # dtype: int64 b = pd.Series([1, 2, 3], index=["a", "b", "c"]) print(b) # a 1 # b 2 # c 3 # dtype: int64 通过字典创建Series (series的索引就是字典的键值) dic = {"name": "xiaoming", "age": 18, "sex": "male"} a = pd.Series(dic) print(a) # name xiaoming # age 18 # sex male # dtype: object dic = {string.ascii_uppercase[i]: i for i in range(3)} a = pd.Series(dic) print(a) # A 0 # B 1 # C 2 # dtype: int64 b = pd.Series(dic, index=list(string.ascii_uppercase[1:4])) print(b) # B 1.0 # C 2.0 # D NaN # dtype: float64 # 重新指定了其他的索引之后,如果能够对上,就取其值,如果不能,就为NaN

    修改数据类型(dtype)

    pandas中修改数据类型与numpy一样。

    b = a.astype("float") print(b) # a中的astype并没有改变

    Series的切片和索引

    Series的属性

    t.index dic = {string.ascii_uppercase[i]: i for i in range(3)} a = pd.Series(dic) print(a) print(a.index) # Index(['A', 'B', 'C'], dtype='object') print(type(a.index)) # <class 'pandas.core.indexes.base.Index'> # a.index 可迭代、可用list强制转换 for ind in a.index: print(ind) A B C b = list(a.index) print(b) # ['A', 'B', 'C'] t.values dic = {string.ascii_uppercase[i]: i for i in range(3)} a = pd.Series(dic) print(a.values) # [0 1 2] print(type(a.values)) # <class 'numpy.ndarray'> print(a.max()) # 2

    pandas读取外部数据

    可直接用pd.read_csv(),读取的数据是DataFrame类型。

    Pandas.DataFrame

    DataFrame(二维数组)的创建

    通过==np.array或者列表[[], []]==创建 a = pd.DataFrame(np.arange(12).reshape(3, 4)) print(a) # 0 1 2 3 # 0 0 1 2 3 # 1 4 5 6 7 # 2 8 9 10 11

    DataFrame对象既有行索引,又有列索引: 行索引表明不同行,叫index(同pd.Series中index),0轴,axis=0; 列索引表明不同列,叫columns,1轴,axis=1.

    通过字典创建 dic = {'name': ["xiaoming", "xiaohong"], "age": [18, 19], "sex":["male", "female"]} a = pd.DataFrame(dic) print(a) # name age sex # 0 xiaoming 18 male # 1 xiaohong 19 female

    可以注意到,字典的键变成了列索引(每一行代表一条数据)。 或者,列表里储存多个不同的字典:

    s = [{"name": "xiaoming", "age": 18,"sex": "male"}, {"name": "xiaohong"}, {"name": "xiaozhang", "sex": "female"}] a = pd.DataFrame(s) print(a) # name age sex # 0 xiaoming 18.0 male # 1 xiaohong NaN NaN # 2 xiaozhang NaN female

    DataFrame的基本属性

    dic = {'name': ["xiaoming", "xiaohong"], "age": [18, 19], "sex":["male", "female"]} a = pd.DataFrame(dic) print(a) # name age sex # 0 xiaoming 18 male # 1 xiaohong 19 female print(a.shape) # (2, 3) print(a.dtypes) # 列数据类型 # name object # age int64 # sex object # dtype: object print(a.ndim) # 2 print(a.index) # RangeIndex(start=0, stop=2, step=1) print(a.columns) # Index(['name', 'age', 'sex'], dtype='object') print(a.values) # [['xiaoming' 18 'male'] ['xiaohong' 19 'female']] print(type(a.values)) # <class 'numpy.ndarray'>

    DataFrame的索引

    a = pd.DataFrame(np.arange(12).reshape(3, 4), index=list("abc"), columns=list("ABCD")) print(a) # A B C D # a 0 1 2 3 # b 4 5 6 7 # c 8 9 10 11

    1. t.loc 通过标签进行索引

    print(a.loc["a"]) # A 0 # B 1 # C 2 # D 3 print(a.loc[["a", "c"]]) # A B C D # a 0 1 2 3 # c 8 9 10 11 print(a.loc[["a", "c"], "B"]) # a 1 # c 9 # Name: B, dtype: int32 pri

    若:

    a = pd.DataFrame(np.arange(12).reshape(3, 4), columns=list("ABCD")) print(a) # A B C D # 0 0 1 2 3 # 1 4 5 6 7 # 2 8 9 10 11 print(a.loc[0]) # A 0 # B 1 # C 2 # D 3 # Name: 0, dtype: int32 print(a.loc[[0, 2], "B"]) print(a.loc["a", "A":"C"]) # 能取到C!!!冒号在loc里是闭合的! # A 0 # B 1 # C 2 Name: a, dtype: int32

    2.== t.iloc ==通过位置进行索引

    print(a.iloc[0]) # A 0 # B 1 # C 2 # D 3 # Name: a, dtype: int32 print(a.iloc[[0, 2], [0, 2]]) # A C # a 0 2 # c 8 10 print(a.iloc[[0, 2], 0:2]) # 取不到2!!!! # A B # a 0 1 # c 8 9

    可进行赋值操作。

    排序、布尔索引、字符串方法

    1. 排序(t.sort_values)

    t = t.sort_values(by="money") # 默认升序,细节可查看源码

    2. 布尔索引 以上情况会报错,应该 即,不同的条件之间应该用括号括起来,用&连接。 3. 字符串方法 字符串方法应用于Series的每一个字符串 例: 拆分成series,然后可用 tolist() 方法将其转换为列表:

    pandas缺失数据处理

    判断是否有NaN(pd.isnull(t))

    删除NaN所在的行列

    import pandas as pd s = [{"name": "xiaoming", "age": 18,"sex": "male"}, {"name": "xiaohong"}, {"name": "xiaozhang", "sex": "female"}] a = pd.DataFrame(s) print(a) # name age sex # 0 xiaoming 18.0 male # 1 xiaohong NaN NaN # 2 xiaozhang NaN female print(a.dropna(axis=0)) # 默认how='any',只要有nan则删除 # name age sex # 0 xiaoming 18.0 male print(a.dropna(axis=0, how='all', inplace=False)) # how='all',整行全部是nan才删除;inplace默认为False,不进行原地修改(a没变) # name age sex # 0 xiaoming 18.0 male # 1 xiaohong NaN NaN # 2 xiaozhang NaN female

    填充NaN(t.fillna())

    s = [{"name": "xiaoming", "age": 18,"sex": "male"}, {"name": "xiaohong"}, {"name": "xiaozhang", "sex": "female"}] a = pd.DataFrame(s) print(a) # name age sex # 0 xiaoming 18.0 male # 1 xiaohong NaN NaN # 2 xiaozhang NaN female print(a.fillna(10)) # name age sex # 0 xiaoming 18.0 male # 1 xiaohong 10.0 10 # 2 xiaozhang 10.0 female print(a.mean()) # age 18.0 # dtype: float64 print(a.fillna(a.mean())) # name age sex # 0 xiaoming 18.0 male # 1 xiaohong 18.0 NaN # 2 xiaozhang 18.0 female print(a["age"].fillna(a["age"].mean())

    在pandas中,计算均值等操作不会把NaN计算进去!

    pandas常用统计方法

    例1:

    一组从2006年到2016年1000部最流行的电影数据,想知道这些电影数据中评分的平均分、导演人数等信息。 数据来源:https://www.kaggle.com/damianpanek/sunday-eda/data

    import pandas as pd df = pd.read_csv('/content/drive/My Drive/IMDB-Movie-Data.csv') df.info() df.head(1)

    # 获取电影的平均评分 df['Rating'].mean() # 获取导演人数 len(df["Director"]) # 1000(可能有重复) # 获取导演人数 len(set(df['Director'])) # 644 len(df["Director"].unique()) # 644 # 获取演员人数 type(df['Actors']) # pandas.core.series.Series df["Actors"][0] # ‘Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana’ df["Actors"].str.split(',') # pandas.core.series.Series 即Series中存放了列表 # 结果如下图所示

    df['Actors'].str.split(",").tolist() # 列表嵌套列表

    pre_actors_list = df["Actors"].str.split(",").tolist() actor_list = [i for j in pre_actors_list for i in j] actor_num = len(set(actor_list)) # actor_num = len(np.unique(actor_list)) # np.unique() 可作用于一维数组或列表 actor_num

    例2:

    例3(电影类别统计):

    import numpy as np import pandas as pd from matplotlib import pyplot as plt # 统计每个电影类别(Genre)的数量 df = pd.read_csv('IMDB-Movie-Data.csv') ge = df["Genre"].str.split(',').tolist() # 列表嵌套列表 ge2= set([i for j in ge for i in j]) ge_num = len(ge2) # 创建全为0的DataFrame df_zeros = pd.DataFrame(np.zeros((df.shape[0], ge_num)), columns=ge2) print(df_zeros) for i in range(df.shape[0]): df_zeros.loc[i, ge[i]] = 1 # print(df_zeros.head(3)) genre_count = df_zeros.sum(axis=0) # pandas.Series genre_count = genre_count.sort_values() # 升序排列 print(genre_count) # 画条形图 plt.figure(figsize=(20, 8), dpi=80) plt.bar(range(len(genre_count)), genre_count.values) x_ = genre_count.index plt.xticks(range(len(genre_count)), x_) plt.show()

    分组与聚合

    见《利用Python进行数据分析》P274

    Processed: 0.029, SQL: 8