python机器学习预测
Why create quantiles
为什么要创建分位数
The used data
使用的数据
Create quantiles on train
在火车上创建分位数
Apply the quantiles on new data
将分位数应用于新数据
Conclusion
结论
When we build a project involving a machine learning component we use metrics (e.g AUC, RMSE, …) in order to choose which model fits our data the best. Then we use the sklearn predict() function to get the regression’s value or the sklearn predict_proba() function to get the probability for the observation to be equal to the class (1 for a binary classification). The common point is that it returns a continuous output that has to be used for a business strategy.
当我们构建包含机器学习组件的项目时,我们使用指标(例如AUC,RMSE等)来选择最适合我们数据的模型。 然后,我们使用sklearn Forecast predict()函数获取回归值,或者使用sklearn predict_proba()函数获取观测值与类相等的概率(对于二进制分类,为1)。 共同点是,它返回必须用于业务策略的连续输出。
Building a business strategy is generally more complex than “do an action if probability ≥ x, else do nothing” and less complex than “for each possible value of the continuous output, do a differentiate action”. This is why the creation of quantiles makes sense. To make it more concrete, here are 2 examples related to a regression task and a binary classification task.
制定业务战略通常比“如果概率≥x时采取行动,否则不采取任何行动”更复杂,而比“对于连续输出的每个可能值,采取有区别的行动”复杂得多。 这就是为什么创建分位数很有意义的原因。 为了更具体,这里有两个与回归任务和二进制分类任务有关的示例。
Example 1: Imagine we are working in the marketing department of a retail clothing company where the objective is to predict the amount of dollars each customer of the database will spend during the year. Based on that, the commercial strategy would be to provide an access to private sales for the 15% of clients who will spend the most, a discount for the 15% to 40%, free shipping for the 40% to 60% and a newsletter for the remaining clients.
示例1:想象一下,我们在一家零售服装公司的市场部门工作,目标是预测数据库中每个客户在这一年中将花费的美元数量。 基于此,商业策略将是为花费最多的15%的客户提供私人销售渠道,为15%到40%的客户提供折扣,为40%到60%的客户提供免费送货,以及时事通讯对于其余的客户。
Example 2: Imagine we are working in the collection department of a house renting company where we have the client’s database in delay of payment and where the objective is to predict which client will still be in delay at the end of the month. Moreover, we know that if we do nothing, 50% of these clients will still be in delay. Given the capacities and the costs, the agents can call 10% of the population, send an email to 30% of them, send a SMS to 20% of them and do nothing for the remaining 40%.
示例2:想象一下,我们正在一家房屋租赁公司的收款部门工作,在该公司中,我们的客户数据库处于延迟付款状态,而目标是预测哪个客户在月底时仍会延迟。 而且,我们知道如果不采取任何措施,这些客户中有50%仍会延迟。 给定容量和成本,座席可以呼叫10%的人口,向30%的人口发送电子邮件,向20%的人口发送短信,而对其余40%的人口不做任何事情。
As a result, creating quantiles enables to apply these strategies on each part of the distribution and leverage the added value of the machine learning model.
结果,创建分位数可以将这些策略应用于分布的每个部分,并利用机器学习模型的附加值。
For this article I will use a dataset related to sport gambling where the objective is to classify “win vs not win” given 45 features. I divided the dataset in train, validation and out of time test, where out of time test is the season 2019/2020, having 732 observations. The distribution of train is 70% and validation is 30% with a respective length of 3262 and 1398. The target = 1 (win) is 44.3%, 44.28% and 43.44% therefore it is well balanced across the different sets. I trained a Gradient Boosting algorithm using the Catboost library and the predict_proba of each dataset have this distribution:
在本文中,我将使用与体育博彩相关的数据集,其目的是在给定45种功能的情况下将“赢与不赢”分类。 我将数据集分为训练,验证和超时测试,其中超时测试是2019/2020季节,具有732个观测值。 火车的分布为70%,验证为30%,长度分别为3262和1398。目标= 1(获胜)为44.3%,44.28%和43.44%,因此在不同组之间具有很好的平衡。 我使用Catboost库训练了Gradient Boosting算法,每个数据集的predict_proba都具有以下分布:
Visually we can see that the distribution of probabilities seems to be the same and the Kolmogorov-Smirnov two samples test confirmed it. As a result, we can create our quantiles.
从视觉上我们可以看到概率分布似乎是相同的,而Kolmogorov-Smirnov的两个样本检验证实了这一点。 结果,我们可以创建分位数。
Be careful ! The hypothesis of having the same distribution is mandatory, if it’s not the case, the quantile will not be generalized and the business strategies will fail.
小心 ! 具有相同分布的假设是强制性的,如果不是这种情况,那么分位数将不会被概括,并且业务策略将失败。
In this use-case, I chose to divide my distribution in 7 quantiles (100/7 = 14.23%). A possible intuition would be to create a bin at each 14.23% but in the case of a distribution which is not uniform, we would have very few observations in the extremities and almost all of them in the middle.
在这个用例中,我选择将分布分为7个分位数(100/7 = 14.23%)。 可能的直觉是在每个14.23%处创建一个bin,但是在分布不均匀的情况下,我们在末端几乎没有观察到,几乎在中间都观察到了。
So the idea is that given a distribution, we want an equal number of observations in each quantile. The function of pandas for such task is pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicated='raise’) where x is the 1d array or a Series; q is the number of quantile; labels allows to set a name to each quantile {ex: Low — Medium — High if q=3} and if labels=False the integer of the quantile is returned; retbins=True return an array of boundaries for each quantile.
因此,想法是给定一个分布,我们希望每个分位数都具有相等数量的观察值。 熊猫执行此任务的功能是pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicated='raise')其中x是一维数组或Series; q是分位数; 标签允许为每个分位数设置一个名称{ex:低—中—高,如果q = 3}并且如果labels = False ,则返回分位数的整数; retbins = True返回每个分位数的边界数组。
In the code below, we create the feature ‘quantile’, and ‘edges’ is the array obtained by the argument retbins=True. The values of ‘edges’ are on the interval [0.13653858 ; 0.88398169] which is a huge limitation. Indeed, a probability is between [0 ; 1] therefore it is perfectly possible to have in the validation and test datasets an observation having a y_proba = 0.094 or a y_proba = 0.925. This is why I modified the array by replacing the boundary of 0.13653858 by -0.001 (in order to have 0 included) and the boundary of 0.88398169 by 1 while keeping the others quantiles’ boundaries as they are in the array. Without this change, the quantiles of y_proba = 0.094 or 0.925 would have the values NaN.
在下面的代码中,我们创建特征'quantile','edges'是由参数retbins = True获得的数组。 'edges'的值在[0.13653858; 0.88398169],这是一个巨大的限制。 确实,概率在[0; 1]因此,在验证和测试数据集中完全有可能具有y_proba = 0.094或y_proba = 0.925的观察值。 这就是为什么我通过将0.13653858的边界替换为-0.001(以便包含0)和0.88398169的边界替换为1,同时将其他分位数的边界保持在数组中的原因来修改数组的原因。 如果不进行此更改,则y_proba = 0.094或0.925的分位数将具有NaN值。
Moreover, qcut associates the 0 value to the lowest quantile of x on an ascending order but in some industries (like credit scoring) it is on a decreasing order so that is why I re-ordered it to have the 0 quantile for the highest quantile of probabilities.
此外,qcut将0值按升序关联到x的最低分位数,但在某些行业(如信用评分)中,它按降序排列,因此这就是为什么我将其重新排序为0表示最高分位数的概率。
Now that we have an array for our boundaries, let’s transform it into an IntervalIndex.
现在我们有了边界的数组,让我们将其转换为IntervalIndex。
The key feature of IntervalIndex is that looking up an indexer should return all intervals in which the indexer's values fall. FloatIndex is a poor substitute, because of floating point precision issues, and because I don't want to label values by a single point. — Stephan SHOYER
IntervalIndex的主要功能是查找索引器应返回索引器值所在的所有间隔。 由于浮点精度问题 ,并且因为我不想用单点标记值,所以FloatIndex不能很好地替代。 — 斯蒂芬·舒耶 ( Stephan SHOYER)
I used the function of pandas pandas.IntervalIndex.from_breaks(breaks, closed='right', name=None, copy=False, dtype=None) where breaks is the 1d array ‘ed’ previously defined, closed=’right’ aims to represent which part of the interval is closed (right means that the value on the right is included to the interval) and dtype is the inferred format (in our case I let None and the inferred format was float64).
我使用了pandas pandas.IntervalIndex.from_breaks(breaks, closed='right', name=None, copy=False, dtype=None) ,其中breaks是先前定义的1d数组'ed', closed ='right'的目标表示关闭间隔的哪一部分(右边表示右边的值包含在间隔中),而dtype是推断的格式(在本例中,我让None且推断的格式为float64)。
Now, the interval index object can be used inside the pandas function pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True). Here, x is the 1d array or Series to bin, bins is the criteria for the binning and it can take an integer, a sequence of scalar or an IntervalIndex; right, labels, include_lowest, precision and ordered are ignored if bins is an IntervalIndex. In the example below, I create a new feature ‘quantile_interval’ which apply the cut of y_proba based on the IntervalIndex. The train dataset looks like the Figure1 below.
现在,可以在pandas函数pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True)使用间隔索引对象pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True) 。 这里,x是一维阵列或系列到箱, 箱为像素合并的标准和它可以采取的整数,标量或IntervalIndex的序列; 如果bin是IntervalIndex,则right , labels , include_lowest , precision和ordered会被忽略。 在下面的示例中,我创建了一个新功能'quantile_interval',该功能基于IntervalIndex应用了y_proba的剪切。 火车数据集如下图1所示。
To finish the creation of quantiles, I store the quantile values in a DataFrame named dict_inter_quantile (see Figure2) and their associated IntervalIndex.
为了完成分位数的创建,我将分位数值存储在一个名为dict_inter_quantile的DataFrame中(请参见图2)及其关联的IntervalIndex。
Figure1. Train dataset ones it’s cut with bins=Interval_Index 图1。 训练使用bins = Interval_Index剪切的数据集 Figure2. dict_inter_quantile DataFrame 图2。 dict_inter_quantile数据框Now that the way to create quantiles on the train set is explained, let’s see how to apply them on new data such as validation , test , and any new data.
现在说明了在训练集上创建分位数的方法,让我们看看如何将其应用于新数据,如validation,test和任何新数据。
This part is very simple because all of the job was done before. So we just have to use the cut function in order to have the IntervalIndex of the probability and to join the dataset to dict_inter_quantile in order to have the value of the quantiles.
这部分非常简单,因为所有工作之前都已完成。 因此,我们仅需使用cut函数即可获得概率的IntervalIndex,并将数据集加入dict_inter_quantile即可获得分位数的值。
Example with the test set (same process for validation or any new data) which provide the output of Figure3.
测试集示例(相同的验证过程或任何新数据),提供了图3的输出。
Figure3. Test dataset with the associated quantile & quantile_interval based on the IntervalIndex of the train set. 图3。 根据火车集的IntervalIndex,使用关联的分位数和quantile_interval测试数据集。Now that we have the related quantile of the predicted probability for each new observation, we can verify if the distribution over quantiles is equally distributed on the validation and test as the train.
既然我们有了每个新观测值的预测概率的相关分位数,我们就可以验证分位数的分布是否在验证和测试中像火车一样平均分布。
Figure4. value_counts on train — validation — test dataset 图4。 火车上的value_counts —验证—测试数据集So what ? Figure 4 above shows that the percentage of observations across the quantiles between train — validation — test is very similar which means that the data didn’t shift and as a result we can apply to the test dataset the boundaries found in the train dataset. Moreover, we can apply an operational strategy as the examples in the beginning, we can check the percentage of y = 1 in each quantile to see if it’s ranked in the same order and / or if the percentage is similar across datasets, we can try to improve some part of the distribution, …
所以呢 ? 上面的图4显示,在火车—验证—测试之间的分位数上观察到的百分比非常相似,这意味着数据没有移动,因此我们可以将测试数据集中的数据应用于火车数据集。 此外,我们可以在开始时以操作策略为例,可以检查每个分位数中y = 1的百分比,以查看其排名是否相同和/或在整个数据集中该百分比是否相似,我们可以尝试改善部分分布,…
We saw why it can be relevant to create quantiles from a prediction in order to apply a business strategy, how to create them from a distribution and how to use it on new ones. This only makes sense if your distributions are the same. You can check it with the Kolmogorov-Smirnov test or a more exotic way.
我们了解了为什么要根据预测来创建分位数以应用业务策略,如何从分布中创建分位数以及如何在新的分位数中使用分位数是有意义的。 仅当您的分布相同时才有意义。 您可以使用Kolmogorov-Smirnov检验或更奇特的方式进行检查。
翻译自: https://towardsdatascience.com/why-how-to-create-quantiles-from-a-machine-learning-prediction-python-code-da9071622db2
python机器学习预测
相关资源:微信小程序源码-合集6.rar