熊猫数据集
Pandas is a very powerful and versatile Python data analysis library that expedites the data analysis and exploration process. One of the advantages of Pandas is that it provides a variety of functions and methods for data manipulation.
Pandas是一个功能强大且用途广泛的Python数据分析库,可加快数据分析和探索过程。 熊猫的优点之一是,它提供了多种功能和方法来进行数据处理。
A dataframe is the core data structure of Pandas. In order to master Pandas, you should be able to play around with dataframes easily and smoothly. In this post, we will go over different ways to manipulate or edit them.
数据框是熊猫的核心数据结构。 为了掌握Pandas,您应该能够轻松流畅地使用数据框。 在本文中,我们将介绍各种不同的方式来进行操作或编辑。
Let’s start with importing NumPy and Pandas and creating a sample dataframe.
让我们开始导入NumPy和Pandas并创建一个示例数据框。
import numpy as npimport pandas as pdvalues = np.random.randint(10, size=(3,7))df = pd.DataFrame(values, columns=list('ABCDEFG'))df.insert(0, 'category', ['cat1','cat2','cat3'])dfThe first way for manipulation we will mention is the melt function which converts wide dataframes (high number of columns) to narrow ones. Some dataframes are structured in a way that consecutive measurements or variables are represented as columns. In some cases, representing these columns as rows may fit better with our task.
我们将提到的第一种操作方式是melt函数,它将宽数据帧(大量列)转换为窄数据帧。 某些数据帧以连续测量或变量表示为列的方式构造。 在某些情况下,将这些列表示为行可能更适合我们的任务。
#1 meltdf_melted = pd.melt(df, id_vars='category')df_melted.head()The column specified with the id_vars parameter remains the same and the other columns are represented under the variable and value columns.
用id_vars参数指定的列保持不变,而其他列则在variable和value列下表示。
The second way is the stack function which increases the index level.
第二种方法是增加索引级别的堆栈函数。
If dataframe has a simple column index, stack returns a series whose indices consist of row-column pairs of original dataframe. 如果数据帧具有简单的列索引,则堆栈将返回一个序列,该序列的索引由原始数据帧的行-列对组成。 If dataframe has multi-level index, stack increases the index level. 如果数据帧具有多级索引,则堆栈会增加索引级别。Consider the following dataframe:
考虑以下数据框:
#2 stackdf_stacked = df_measurements.stack().to_frame()df_stacked[:6]The stack function, in this case, returns a Series object but we converted it to a dataframe using the to_frame function.
在这种情况下,堆栈函数返回Series对象,但是我们使用to_frame函数将其转换为数据帧。
The unstack function, as the name indicates, reverses the operation of the stack function.
顾名思义,unstack功能可逆转堆栈功能的操作。
#3 unstackdf_stacked.unstack()Adding or dropping columns is probably the manipulation we do most. Let’s both add a new column and drop some of the existing ones.
添加或删除列可能是我们最常进行的操作。 我们都添加一个新列,然后删除一些现有列。
#4 add or drop columnsdf['city'] = ['Rome','Madrid','Houston']df.drop(['E','F','G'], axis=1, inplace=True)dfWe created a new column with a list. Pandas Series or NumPy array can also be used to create a column.
我们创建了一个带有列表的新列。 Pandas Series或NumPy数组也可以用于创建列。
To drop columns, in addition to the name of the columns, the axis parameters should be set to 1. The inplace parameter is set to True in order to save the changes.
要删除列,除了列名之外,还应将轴参数设置为1。将inplace参数设置为True,以保存更改。
New columns are added at the end of dataframe by default. If you want the new columns to be placed at a specific location, you should use the insert function.
默认情况下,新列会添加到数据框的末尾。 如果要将新列放置在特定位置,则应使用插入功能。
#5 insertdf.insert(0, 'first_column', [4,2,5])dfWe may also want to add or drop rows.
我们可能还想添加或删除行。
The append function can be used to add new rows.
append函数可用于添加新行。
#6 add or drop rowsnew_row = {'A':4, 'B':2, 'C':5, 'D':4, 'city':'Berlin'}df = df.append(new_row, ignore_index=True)dfWe can drop a now just like dropping columns. The only change is the axis parameter value.
我们现在可以像删除列一样删除a。 唯一的变化是轴参数值。
df.drop([3], axis=0, inplace=True)dfAnother modification on dataframes can be achieved by the pivot_table function. Consider the following dataframe with 30 rows:
数据框的另一种修改可以通过pivot_table函数实现。 考虑以下具有30行的数据框:
import randomA = np.random.randint(10, size=30)B = np.random.randint(10, size=30)city = random.sample(['Rome', 'Houston', 'Berlin']*10, 30)cat = random.sample(['cat1', 'cat2', 'cat3']*10 ,30)df = pd.DataFrame({'A':A, 'B':B, 'city':city, 'cat':cat})df.head()The pivot_table function can also be considered as a way to look at the dataframe from a different perspective. It is used to explore the relationships among variables by allowing them present data in different formats.
ivot_table函数也可以视为从不同角度查看数据框的一种方式。 通过允许变量以不同格式显示数据,它用于探索变量之间的关系。
#7 pivot_tabledf.pivot_table(index='cat', columns='city', aggfunc='mean')The return dataframe contains the mean values for each city-cat pair.
返回数据帧包含每个城市猫对的平均值。
We have covered 7 ways to edit or manipulate a dataframe. Some of them are so common that you are probably using them almost every day. There will also be cases in which you need to use the rare ones.
我们已经介绍了7种编辑或操作数据框的方法。 其中一些非常普遍,以至于您可能几乎每天都在使用它们。 在某些情况下,您需要使用罕见的情况。
I think the success and prevalence of Pandas come from the versatile, powerful, and easy-to-use functions to manipulate and analyze data. There are almost always multiple ways to do a task with Pandas. Since a big portion of the time spent on a data science project is spent during data cleaning and preprocessing steps, it is highly encouraged to learn Pandas.
我认为,Pandas的成功与盛行来自于操作,分析数据的多功能,强大且易于使用的功能。 熊猫几乎总是有多种方式来完成一项任务。 由于花在数据科学项目上的大部分时间都花在了数据清理和预处理步骤上,因此强烈建议您学习熊猫。
Thank you for reading. Please let me know if you have any feedback.
感谢您的阅读。 如果您有任何反馈意见,请告诉我。
翻译自: https://towardsdatascience.com/7-ways-to-manipulate-pandas-dataframes-f5ec03fe944c
熊猫数据集