A few days ago, I was working on a project on risk management, until I came across a simple problem that I had never seen.
几天前,我正在从事一个风险管理项目,直到遇到一个从未见过的简单问题。
I needed to group data on the different IDs of a transaction dataset. This dataset contains 8,250 transactions for 1,125 customers of a bank.
我需要将数据按事务数据集的不同ID分组。 该数据集包含银行的1,125个客户的8,250笔交易。
Here are the first five rows:
这是前五行:
As you can see, all the rows have the same ID and I needed to group all the transactions according to the ID that created them.
如您所见,所有行都具有相同的ID,我需要根据创建它们的ID对所有事务进行分组。
Without having the solution in my head, I started looking in my Data Science books and more than anything those that were about data analysis and feature engineering but I didn’t find anything.
在没有解决方案的情况下,我开始在数据科学书籍中寻找更多有关数据分析和功能工程的书籍,但没有找到任何东西。
For this reason I started looking at Stack Overflow until I found something that looked like my situation but I had to adapt it to the characteristics of the project.
因此,我开始研究Stack Overflow,直到发现与我的情况类似的东西,但我不得不根据项目的特点进行调整。
here. 此处检查项目。To get around this problem you need the well-known groupby of Pandas.
为了解决这个问题,你需要熟知groupby大熊猫。
First, we are going to group the prod_code according to their ID (if you want to know more you can enter GitHub with the link above). But we are going to group the various prod_code values into a list with the following code:
首先,我们将根据prod_code的ID对其进行分组(如果您想了解更多信息,可以通过上面的链接输入GitHub)。 但是我们将使用以下代码将各种prod_code值分组到一个列表中:
df_0 = payment_data.groupby(['id'])['prod_code'].apply(list).reset_index()df_0.head(10)This is the output:
这是输出:
As we can see, now we have all the unique different values of prod_code (one for every transaction made) grouped by the ID.
如我们所见,现在我们按ID分组了prod_code所有唯一不同值(每笔交易一个)。
The next feature that I needed to group was new_balance , where I decided to get the mean of all the new balances for every prod_code:
我需要分组的下一个功能是new_balance ,在这里我决定获取每个prod_code的所有新余额的prod_code :
df_1 = payment_data.groupby(['id'])['new_balance'].apply(np.mean, axis=0).reset_index()df_1.head(10)Now, I had to group all the different types of overall payment: OVD_t1, OVD_t2, and OVD_t3. First, I grouped the sum of every type with their correspondent ID and then grouped all the types together in one dataframe.
现在,我必须对所有不同类型的总付款进行分组: OVD_t1 , OVD_t2和OVD_t3 。 首先,我将每种类型的总和与它们的对应ID进行分组,然后将所有类型分组到一个数据帧中。
For this I had to use the function reduce:
为此,我不得不使用函数reduce :
from functools import reducedf_OVD_t1 = payment_data.groupby(['id'])['OVD_t1'].apply(np.sum, axis=0).reset_index()df_OVD_t2 = payment_data.groupby(['id'])['OVD_t2'].apply(np.sum, axis=0).reset_index()df_OVD_t3 = payment_data.groupby(['id'])['OVD_t3'].apply(np.sum, axis=0).reset_index()OVD_dfs = [df_OVD_t1, df_OVD_t2, df_OVD_t3]df_2 = reduce(lambda left,right: pd.merge(left,right,on='id'), OVD_dfs)df_2.head(10)The final feature that I needed to group was pay_normal. Here, the only operation that I applied was np.sum because I needed the total of times that the payment was normal for every client.
我需要分组的最后一个功能是pay_normal 。 在这里,我应用的唯一操作是np.sum因为我需要为每个客户正常支付的总次数。
df_3 = payment_data.groupby(['id'])['pay_normal'].apply(np.sum, axis=0).reset_index()df_3.head(10)The final step for this task was to merge all the dataframes that I created into only one. For this, I used the reduce function again in the same form as in the grouping of the OVD_types before.
该任务的最后一步是将我创建的所有数据框仅合并为一个。 为此,我再次以与以前的OVD_types分组相同的形式使用了reduce函数。
dfs = [df_0, df_1, df_2, df_3]df_final = reduce(lambda left,right: pd.merge(left,right,on='id'), dfs) df_final.head(10)And that’s it. I hope this tutorial helps you, and thanks for getting here.
就是这样。 希望本教程对您有所帮助,并感谢您来到这里。
If you liked the article, I invite you to follow me on dev.to, GitHub and LinkedIn!
如果您喜欢这篇文章,我邀请您在dev.to , GitHub和LinkedIn上关注我!
翻译自: https://medium.com/swlh/group-your-data-by-index-f5760f0f3e3f
相关资源:四史答题软件安装包exe