pandas入门与进阶（一）

科技2025-08-02 28

文章目录

1.读取数据2.查看数据集基本信息3.pandas数据结构分 DataFrame 和 Series4.Pandas查询数据（推荐.loc，既能查询，又能覆盖写入）5.新增数据列6.数据统计函数7.缺失值处理过程8.数据排序9.字符串处理10.处理日期时间数据11.groupby分组12.groupby后其他列数据处理13. 按行遍历DataFrame的3种方法14. DataFrame赋值方法15. 数据转换函数map、apply、applymap

1.读取数据

import pandas as pd # 使用pd.read_csv读取数据 fpath = "./datas/ml-latest-small/ratings.csv" df = pd.read_csv(fpath) # 读取txt文件，自己指定分隔符、列名 fpath = "./datas/crazyant/access_pvuv.txt" df = pd.read_csv( fpath, sep="\t", header=None, names=['pdate', 'pv', 'uv'] ) # 如果读取文件没有列名（输出默认第一行内容为列名）， # 则需要给header=None同时names给自定义列名列表 # 读取excel文件 fpath = "./datas/crazyant/access_pvuv.xlsx" df = pd.read_excel(fpath, skiprows=2) # skiprows 忽略前面的2个空行 # 读取MySQL数据库 import pymysql conn = pymysql.connect( host='127.0.0.1', user='root', password='test', database='test', charset='utf8' ) mysql_df = pd.read_sql("select * from table_name", con=conn)

2.查看数据集基本信息

# 查看数据类型 type(df) # 查看前几行数据，默认前5行 df.head() # 查看后几行数据 df.tail() # 查看数据的形状，返回(行数、列数) df.shape # 数据各列名包含的数据计数 df.count() # 查看列名列表 df.columns # 查看索引列 df.index # 查看每列的数据类型 df.dtypes # 查看数据值 df.values # 提取所有数字列统计结果 df.describe() # 查看数据基本信息 df.info（）

3.pandas数据结构分 DataFrame 和 Series

DataFrame是一个表格型的数据结构

1. 每列可以是不同的值类型（数值、字符串、布尔值等） 2. 既有行索引index,也有列索引columns 3. 可以被看做由Series组成的字典 4. 一行一列为Series，多行多列为DataFrame # 创建 Series 数据 ser_data = [1, 'a', 5.2, 7] ser = pd.Series(ser_data, index=['d','b','a','c']) # 字典形式创建 ser_dict_data = {'Ohio':35,'Texas':72,'Oregon':16,'Utah':50} ser2 = pd.Series(ser_dict_data ) # 为Series的列名赋值 ser.name = 'new_name' # 字典形式创建 DataFrame data={ 'state':['Ohio','Ohio','Ohio','Nevada','Nevada'], 'year':[2000,2001,2002,2001,2002], 'pop':[1.5,1.7,3.6,2.4,2.9] } df = pd.DataFrame(data)

4.Pandas查询数据（推荐.loc，既能查询，又能覆盖写入）

df是可以直接做条件筛选或切片操作的 df[ 条件或切片操作]df.loc方法，根据行、列的标签值查询df.iloc方法，根据行、列的数字位置查询----只能整数df.where方法df.query方法 # Series类型 df['bWendu'] df.loc[:,'bWendu'] # DataFrame类型 df[['bWendu']] df.loc[:,['bWendu']]

1、使用单个label值查询数据

# 得到单个值 df.loc['2018-01-03', 'bWendu'] # 得到一个Series df.loc['2018-01-03', ['bWendu', 'yWendu']]

2、使用值列表批量查询

# 得到Series df.loc[['2018-01-03','2018-01-04','2018-01-05'], 'bWendu'] # 得到DataFrame df.loc[['2018-01-03','2018-01-04','2018-01-05'], ['bWendu', 'yWendu']]

3、使用数值区间进行范围查询

# 行index按区间 df.loc['2018-01-03':'2018-01-05', 'bWendu'] # 列index按区间 df.loc['2018-01-03', 'bWendu':'fengxiang'] # 行和列都按区间查询 df.loc['2018-01-03':'2018-01-05', 'bWendu':'fengxiang']

4、使用条件表达式查询

## 查询最高温度小于30度，并且最低温度大于15度，并且是晴天，并且天气为优的数据 df.loc[(df["bWendu"]<=30) & (df["yWendu"]>=15) & (df["tianqi"]=='晴'), :]

5、调用函数查询

# 直接写lambda表达式 df.loc[lambda df : (df["bWendu"]<=30) & (df["yWendu"]>=15), :] # 编写自己的函数，查询9月份，空气质量好的数据 def query_my_data(df): return df.index.str.startswith("2018-09") & (df["aqiLevel"]==1) df.loc[query_my_data, :]

iloc() 和 at() 、iat()、ix()

# 取前两列 df.iloc[:, :2] # 取前两行 df.iloc[:2, ] df.iloc[0,2] df.iloc[1:4,[0,2]] df.iloc[[1,3,5],[0,2]] df.iloc[[1,3,5],0:2] # ix 的功能更加强大，参数既可以是索引，也可以是名称，相当于，loc和iloc的合体 df.ix[1:3,['a','b']] df.ix[[1,3,5],['a','b']] # at 根据指定行index及列label，快速定位DataFrame的某个元素，选择列时仅支持列名 df.at[3,'a'] # iat 与at的功能相同，只使用索引参数 df.iat[3, 0] df.iat[3,'a']

5.新增数据列

直接赋值 df.loc[:, "wencha"] = df["bWendu"] - df["yWendu"] df.apply 方法 Apply a function along an axis of the DataFrame. def get_wendu_type(x): if x["bWendu"] > 33: return '高温' if x["yWendu"] < -10: return '低温' return '常温' # 注意需要设置axis==1，这是series的index是columns df.loc[:, "wendu_type"] = df.apply(get_wendu_type, axis=1) df.assign 方法 Assign new columns to a DataFrame. Returns a new object with all original columns in addition to new ones. # 可以同时添加多个新的列 df.assign( yWendu_huashi = lambda x : x["yWendu"] * 9 / 5 + 32, # 摄氏度转华氏度 bWendu_huashi = lambda x : x["bWendu"] * 9 / 5 + 32 ) 按条件选择分组分别赋值 # 先创建空列（这是第一种创建新列的方法） df['wencha_type'] = '' df.loc[df["bWendu"]-df["yWendu"]>10, "wencha_type"] = "温差大" df.loc[df["bWendu"]-df["yWendu"]<=10, "wencha_type"] = "温差正常"

6.数据统计函数

协方差：衡量同向反向程度，如果协方差为正，说明X，Y同向变化，协方差越大说明同向程度越高；如果协方差为负，说明X，Y反向运动，协方差越小说明反向程度越高。相关系数：衡量相似度程度，当他们的相关系数为1时，说明两个变量变化时的正向相似度最大，当相关系数为－1时，说明两个变量变化的反向相似度最大 # 一下子提取所有数字列统计结果 df.describe() # 查看单个Series的数据 df["bWendu"].mean() # 平均值 df["bWendu"].max() # 最大值 df["bWendu"].min() # 最小值 df["fengxiang"].unique() # 去重 df["fengxiang"].value_counts() # 按值计数 # 所有数值列相关程度 df.cov() # 协方差矩阵 df.corr() # 相关系数矩阵 # 单独查看空气质量和最高温度的相关系数 df["aqi"].corr(df["bWendu"])

7.缺失值处理过程

步骤1：读取excel的时候，忽略前几个空行（也可利用dropna删除空行） studf = pd.read_excel("./datas/student_excel/student_excel.xlsx", skiprows=2) 步骤2：检测数据空值 studf.isnull() # 筛选没有空分数的所有行 studf.loc[studf["分数"].notnull(), :] 步骤3：删除掉全是空值的列 studf.dropna(axis="columns", how='all', inplace=True) # axis : 删除行还是列，{0 或 ‘index’, 1 或 ‘columns’}, default 0 # how : any则任何值为空都删除，all则所有值都为空才删除 # inplace : True修改当前df，False返回新的df 步骤4：删除掉全是空值的行 studf.dropna(axis="index", how='all', inplace=True) 步骤5：填充应该有值的空值项 studf.loc[:, '分数'] = studf['分数'].fillna(value=0) studf.loc[:, '姓名'] = studf['姓名'].fillna(method="ffill") # value：用于填充的值，可以是单个值，或者字典（key是列名，value是值） # method : 等于ffill使用前一个不为空的值填充（forword fill） # 等于bfill使用后一个不为空的值填充（backword fill） # axis : 按行还是列填充，{0 or ‘index’, 1 or ‘columns’} # inplace : True修改当前df，False返回新的df 步骤6：保存处理清洗后的数据 studf.to_excel("./datas/student_excel/student_excel_clean.xlsx", index=False)

8.数据排序

Series 排序： Series.sort_values(ascending=True, inplace=False) df["aqi"].sort_values(ascending=False) # ascending：默认为True升序排序，为False降序排序 # inplace：是否修改原始Series DataFrame 排序 DataFrame.sort_values(by, ascending=True, inplace=False) # 分别指定升序和降序 df.sort_values(by=["aqiLevel", "bWendu"], ascending=[True, False]) # by：字符串或者List<字符串>，单列排序或者多列排序 # ascending：bool或者List，升序还是降序，如果是list对应by的多列 # inplace：是否修改原始DataFrame

9.字符串处理

使用方法：先获取Series的str属性，然后在属性上调用函数；只能在字符串列上使用，不能数字列上使用；Dataframe上没有str属性和处理方法，只有Series有Series.str并不是Python原生字符串，而是自己的一套方法，不过大部分和原生str很相似 # 获取Series的str属性，使用各种字符串处理函数 # 字符串替换函数 df["bWendu"].str.replace("℃", "") # 判断是不是数字 df["bWendu"].str.isnumeric() # 获取长度 df["bWendu"].str.len() # 使用str的startswith、contains等得到bool的Series可以做条件查询 condition = df["ymd"].str.startswith("2018-03") condition = df["ymd"].str.contains("2018") df[condition].head() # 多次str处理的链式操作 df["ymd"].str.replace("-", "").str.slice(0, 6) # 等同于 # slice就是切片语法，可以直接用 df["ymd"].str.replace("-", "").str[0:6] # 使用正则表达式的处理（2018年12月31日 --> 20181231） # 方法1：链式replace df["中文日期"].str.replace("年", "").str.replace("月","").str.replace("日", "") # 方法2：正则表达式替换 df["中文日期"].str.replace("[年月日]", "")

10.处理日期时间数据

# 将日期列转换成pandas的日期 df.set_index(pd.to_datetime(df["ymd"]), inplace=True) # DatetimeIndex是Timestamp的列表形式 df.index df.index[0] # 固定的某一天筛选 df.loc['2018-01-05'] # 日期区间筛选 df.loc['2018-01-05':'2018-01-10'] # 按月份前缀筛选 df.loc['2018-03'] # 按月份区间筛选 df.loc["2018-07":"2018-09"] # 按年份前缀筛选 df.loc["2018"] # 周、月、季度数字列表 df.index.week df.index.month df.index.quarter # 统计每周、月、季度的数据 df.groupby(df.index.week)["bWendu"].max().plot() df.groupby(df.index.month)["bWendu"].max().plot() df.groupby(df.index.quarter)["bWendu"].max().plot() 处理日期索引的缺失的一般方法 # 一. 使用pandas.reindex方法 ## 1. 将日期列设为索引，再将索引变成日期索引 df_date = df.set_index("pdate") df_date = df_date.set_index(pd.to_datetime(df_date.index)) ## 2. 使用pandas.reindex填充缺失的索引,date_range函数可以设置periods参数 pdates = pd.date_range(start="2019-12-01", end="2019-12-05") # 生成完整的日期序列 df_date_new = df_date.reindex(pdates, fill_value=0) # 二. 使用pandas.resample方法 ## 1. 先将索引变成日期索引 df_new2 = df.set_index(pd.to_datetime(df["pdate"])).drop("pdate", axis=1) ## 2. 使用dataframe的resample的方法按照天重采样 df_new2 = df_new2.resample("D").mean().fillna(0) 说明： resample的含义：改变数据的时间频率

11.groupby分组

所有的聚合统计，都是在dataframe和series上进行的 1.常用用法 df.groupby('A').sum() # 如果不是数值，会被直接忽略 df.groupby(['A','B']).mean() # ('A','B')成对变成了二级索引 df.groupby(['A','B'], as_index=False).mean() df.groupby('A').agg([np.sum, np.mean, np.std]) df.groupby('A')['C'].agg([np.sum, np.mean, np.std]) # 查看单列的结果数据统计 df.groupby('A').agg({"C":np.sum, "D":np.mean}) # 不同列使用不同的聚合函数理解过程 # 1、遍历单个列聚合的分组 g = df.groupby('A') g for name,group in g: print(name) print(group) print() bar A B C D 1 bar one -0.375789 -0.345869 3 bar three -1.564748 0.081163 5 bar two -0.202403 0.701301 foo A B C D 0 foo one 0.542903 0.788896 2 foo two -0.903407 0.428031 g.get_group('bar') # 2、遍历多个列聚合的分组 g = df.groupby(['A', 'B']) # name是一个2个元素的tuple for name,group in g: print(name) print(group) print() g.get_group(('foo', 'one')) g['C'] for name, group in g['C']: print(name) print(group) print(type(group)) print()

12.groupby后其他列数据处理

# 1. 单列-单指标统计 df.groupby("MovieID")["Rating"].mean() # 2. 单列-多指标统计（1） df.groupby("MovieID")["Rating"].agg( mean="mean", max="max", min=np.min ) # 2. 单列-多指标统计（2） df.groupby("MovieID").agg( {"Rating":['mean', 'max', np.min]} ) # 3. 多列-多指标统计（1） df.groupby("MovieID").agg( rating_max=("Rating", "max"), user_count=("UserID", lambda x : x.nunique()), LTB=("LTB": (lambda x: ",".join(x.unique()))) # 尤其对‘字符串’数据列这样处理 ) # 3. 多列-多指标统计（2） df.groupby("MovieID").agg( { "Rating": ['mean', 'min', 'max'], "UserID": lambda x :x.nunique() } ) .reset_index() .rename(columns={"ymd":"月份"})

13. 按行遍历DataFrame的3种方法

# 1. df.iterrows() 慢 for idx, row in df.iterrows(): print(idx, row) print(idx, row["A"], row["B"], row["C"], row["D"]) # 2. df.itertuples() 快 for row in df.itertuples(): print(row) print(row.Index, row.A, row.B, row.C, row.D) # 3. for+zip 快 for A, B in zip(df["A"], df["B"]): print(A, B)

14. DataFrame赋值方法

# 1. 索引或者标签修改值的位置 df.iloc[2,2] = 1111 df.loc['20130101','B'] = 2222 # 2.有条件赋值 df.B[df.A>4] = 0 df.col1[df.col1 =='a'] = 'm' # 3. 按行或列 df['F'] = np.nan # 4. 添加series序列 df['E'] = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130101',periods=6))

15. 数据转换函数map、apply、applymap

map：只用于Series，实现每个值->值的映射apply：用于Series实现每个值的处理，用于Dataframe实现某个轴的Series的处理applymap：只能用于DataFrame，用于处理该DataFrame的每个元素 map用于Series值的转换 # 将股票代码英文转换成中文名字 dict_company_names = { "bidu": "百度", "baba": "阿里巴巴", "iq": "爱奇艺", "jd": "京东" } stocks["公司中文1"] = stocks["公司"].str.lower().map(dict_company_names) # 或者 stocks["公司中文2"] = stocks["公司"].map(lambda x : dict_company_names[x.lower()]) apply用于Series和DataFrame的转换 # Series.apply(function), 函数的参数是每个值 stocks["公司中文3"] = stocks["公司"].apply( lambda x : dict_company_names[x.lower()]) # DataFrame.apply(function), 函数的参数是Series stocks["公司中文4"] = stocks.apply( lambda x : dict_company_names[x["公司"].lower()], axis=1) applymap用于DataFrame所有值的转换 # 将这些数字取整数，应用于所有元素 sub_df.applymap(lambda x : int(x))

Processed: 0.009, SQL: 8