按索引对数据分组

科技2023-11-28 112

A few days ago, I was working on a project on risk management, until I came across a simple problem that I had never seen.

几天前，我正在从事一个风险管理项目，直到遇到一个从未见过的简单问题。

问题 (The Problem)

I needed to group data on the different IDs of a transaction dataset. This dataset contains 8,250 transactions for 1,125 customers of a bank.

我需要将数据按事务数据集的不同ID分组。该数据集包含银行的1,125个客户的8,250笔交易。

Here are the first five rows:

这是前五行：

As you can see, all the rows have the same ID and I needed to group all the transactions according to the ID that created them.

如您所见，所有行都具有相同的ID，我需要根据创建它们的ID对所有事务进行分组。

Without having the solution in my head, I started looking in my Data Science books and more than anything those that were about data analysis and feature engineering but I didn’t find anything.

在没有解决方案的情况下，我开始在数据科学书籍中寻找更多有关数据分析和功能工程的书籍，但没有找到任何东西。

For this reason I started looking at Stack Overflow until I found something that looked like my situation but I had to adapt it to the characteristics of the project.

因此，我开始研究Stack Overflow，直到发现与我的情况类似的东西，但我不得不根据项目的特点进行调整。

here. 此处检查项目。

解决方案 (The Solution)

To get around this problem you need the well-known groupby of Pandas.

为了解决这个问题，你需要熟知groupby大熊猫。

First, we are going to group the prod_code according to their ID (if you want to know more you can enter GitHub with the link above). But we are going to group the various prod_code values into a list with the following code:

首先，我们将根据prod_code的ID对其进行分组(如果您想了解更多信息，可以通过上面的链接输入GitHub)。但是我们将使用以下代码将各种prod_code值分组到一个列表中：

df_0 = payment_data.groupby(['id'])['prod_code'].apply(list).reset_index()df_0.head(10)

This is the output:

这是输出：

As we can see, now we have all the unique different values of prod_code (one for every transaction made) grouped by the ID.

如我们所见，现在我们按ID分组了prod_code所有唯一不同值(每笔交易一个)。

The next feature that I needed to group was new_balance , where I decided to get the mean of all the new balances for every prod_code:

我需要分组的下一个功能是new_balance ，在这里我决定获取每个prod_code的所有新余额的prod_code ：

df_1 = payment_data.groupby(['id'])['new_balance'].apply(np.mean, axis=0).reset_index()df_1.head(10)

Now, I had to group all the different types of overall payment: OVD_t1, OVD_t2, and OVD_t3. First, I grouped the sum of every type with their correspondent ID and then grouped all the types together in one dataframe.

现在，我必须对所有不同类型的总付款进行分组： OVD_t1 ， OVD_t2和OVD_t3 。首先，我将每种类型的总和与它们的对应ID进行分组，然后将所有类型分组到一个数据帧中。

For this I had to use the function reduce:

为此，我不得不使用函数reduce ：

from functools import reducedf_OVD_t1 = payment_data.groupby(['id'])['OVD_t1'].apply(np.sum, axis=0).reset_index()df_OVD_t2 = payment_data.groupby(['id'])['OVD_t2'].apply(np.sum, axis=0).reset_index()df_OVD_t3 = payment_data.groupby(['id'])['OVD_t3'].apply(np.sum, axis=0).reset_index()OVD_dfs = [df_OVD_t1, df_OVD_t2, df_OVD_t3]df_2 = reduce(lambda left,right: pd.merge(left,right,on='id'), OVD_dfs)df_2.head(10)

The final feature that I needed to group was pay_normal. Here, the only operation that I applied was np.sum because I needed the total of times that the payment was normal for every client.

我需要分组的最后一个功能是pay_normal 。在这里，我应用的唯一操作是np.sum因为我需要为每个客户正常支付的总次数。

df_3 = payment_data.groupby(['id'])['pay_normal'].apply(np.sum, axis=0).reset_index()df_3.head(10)

将所有数据框合并为一个 (Merging all the Dataframes into one)

The final step for this task was to merge all the dataframes that I created into only one. For this, I used the reduce function again in the same form as in the grouping of the OVD_types before.

该任务的最后一步是将我创建的所有数据框仅合并为一个。为此，我再次以与以前的OVD_types分组相同的形式使用了reduce函数。

dfs = [df_0, df_1, df_2, df_3]df_final = reduce(lambda left,right: pd.merge(left,right,on='id'), dfs) df_final.head(10)

结论 (Conclusion)

And that’s it. I hope this tutorial helps you, and thanks for getting here.

就是这样。希望本教程对您有所帮助，并感谢您来到这里。

If you liked the article, I invite you to follow me on dev.to, GitHub and LinkedIn!

如果您喜欢这篇文章，我邀请您在dev.to ， GitHub和LinkedIn上关注我！

翻译自: https://medium.com/swlh/group-your-data-by-index-f5760f0f3e3f

相关资源：四史答题软件安装包exe

Processed: 0.011, SQL: 8