异常值检测

    科技2023-12-20  85

    异常值检测

    Outliers detection is a natural preliminary step in exploratory data analysis for getting the most impacting results in most data science projects. This process consists of eliminating exceptionally out-limits values for obtaining the best predictions. Easy to use outlier’s detection methods generally include: Min-max Analysis, Z-Score Extreme Value Analysis, Inter Quartile Range, Extended Isolation forest, Local outliers method. To become a unicorn data-scientist mastering the most efficient outliers detection methods is a must needed skills. In this article, we will review the Kaggle winners’ outliers detection methods which can be implemented in short python codes.

    离群值检测是探索性数据分析中自然而然的第一步,可以在大多数数据科学项目中获得最大的影响。 此过程包括消除异常值以获取最佳预测。 易于使用的离群值检测方法通常包括:最小-最大分析,Z分数极值分析,四分位间距,扩展隔离林,局部离群值方法。 要成为独角兽数据科学家,掌握最有效的离群值检测方法是必需的技能。 在本文中,我们将回顾Kaggle获胜者的异常值检测方法,该方法可以用简短的python代码实现。

    We will analyze our sweet chocolate bar rating dataset you can find here.

    我们将分析我们的甜巧克力棒评级数据集,您可以在这里找到。

    Photo by Massimo Adami on Unsplash Massimo Adami在 Unsplash上 拍摄的照片

    A challenging dataset which contains after categorical encoding more than 2800 features.

    经过分类编码后,包含超过2800个要素的数据集极具挑战性。

    Min-max Value Analysis

    最小-最大值分析

    The challenge is to find the best method to extract outliers most efficiently ultimately creating a model that fits the data points reducing model predictions overall error. Here an example of a min-max analysis of the chocolate rating dataset, a complex multidimensional dataset.

    面临的挑战是找到最有效地提取离群值的最佳方法,最终创建适合数据点的模型,从而减少模型预测的整体误差。 这里是巧克力等级数据集(复杂的多维数据集)的最小-最大分析的示例。

    import pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom scipy .stats import normdf = pd.read_csv('../input/preprocess-choc/dfn.csv')#so my upper limit will be my mean value plus 3 sigmaupper_limit=df.rating.mean()+3*df.rating.std()#my lowar limit will be my mean - 3 sigmalowar_limit=df.rating.mean()-3*df.rating.std()

    Now let’s check our outliers :

    现在让我们检查异常值:

    df[(df.rating>upper_limit)|(df.rating<lowar_limit)]#now we will visualise the data without outliersnew_data=df[(df.rating<upper_limit)& (df.rating>lowar_limit)]#shape of our outlier’sdf.shape[0]-new_data.shape[0]

    Output

    输出量

    Outliers detection using min-max method 使用最小-最大方法检测异常值

    This method achieves the detection of 6 outliers. Let’s plot it:

    该方法实现了6个离群值的检测。 让我们来画一下:

    Data distribution before and after data outliers handling 数据离群值处理前后的数据分发 Photo by Maciej Gerszewski on Unsplash Maciej Gerszewski在 Unsplash上 拍摄的照片

    2. Extreme Value Analysis using Z-Score

    2. 使用Z分数进行极值分析

    Z score tells how many standard deviations away a data point is, in our case mean is 3.198 and std deviation is 0.434 so our Z SCORE for datapoint 4.5 is:

    Z分数表明一个数据点有多少标准偏差,在我们的例子中,平均值为3.198,std偏差为0.434,因此数据点4.5的Z SCORE为:

    4.5–3.198(mean)/0.434(std)= 1.847

    4.5–3.198(平​​均)/0.434(标准)= 1.847

    now we will calculate the z score for our datapoints and display them:

    现在,我们将为我们的数据点计算z得分并显示它们:

    df['zscore']=(df.rating-df.rating.mean())/df.rating.std()df#displaying the outlier’s with respect to the zscoresdf[(df.zscore<-3)|(df.zscore>3)]

    This method achieves the detection of 7 outliers.

    该方法实现了7个离群值的检测。

    Photo by Yulia Shinova on Unsplash Yulia Shinova在 Unsplash上 拍摄的照片

    3. Inter Quartile Range (IQR)

    3. 四分位间距(IQR)

    The InterQuartile Range IQR (or 50% midspread) is a method to measure the difference between the upper (Q1: 25th) and lower (Q3: 75th) quartile IQR = Q3 − Q1. Using out of range data points (min-max) boundaries can be defined, and outliers can be determinate.

    四分位间距 IQR(或50%的中间价差 )是一种用于测量高四分位IQR = Q 3 − Q 1的差异的方法。最多可以定义边界,并且可以确定离群值。

    Let’s code it :

    让我们编写代码:

    Q1=df.rating.quantile(0.25)Q3=df.rating.quantile(0.75)Q1,Q3

    Output

    输出量

    3.0 , 3.5

    3.0,3.5

    #Now we will calculate the IQR :IQR=Q3-Q1IQR

    Output

    输出量

    0.5

    0.5

    Q1 corespond to 25% of the chocolate rating distribution is below 3.0 and Q3 corespond to 75% of the chocolate rating distribution is below 3.5.

    第一季度占25%的巧克力评级分布低于3.0,第三季度占75%的巧克力评级分布低于3.5。

    Now we will define the upper and lower limits of our chocolate rating distribution :

    现在,我们将定义巧克力评级分布的上限和下限:

    LOWAR_LIMIT=Q1-1.5*IQRUPPER_LIMIT=Q3+1.5*IQRLOWAR_LIMIT,UPPER_LIMIT

    Output

    输出量

    2.25, 4.25

    2.25、4.25

    Now we shall display the outliers rating in a dataframe :

    现在,我们将在数据框中显示离群值:

    df[(df.rating<LOWAR_LIMIT)|(df.rating>UPPER_LIMIT)]

    The algorithm found 36 outliers. Let’s plot the results:

    该算法找到了36个离群值。 让我们来绘制结果:

    Data distribution before and after data outliers handling 数据离群值处理前后的数据分发 Photo by Helena Yankovska on Unsplash Helena Yankovska在 Unsplash上 拍摄的照片

    4. Extended Isolation forest method

    4. 扩展隔离林方法

    Extended Isolation Forest it’s a vertical or horizontal decision boundary linear regression algorithms that select a random slope and bias from a training data. The only method drawback branch cuts data points are not properly detected.

    扩展隔离林是一种垂直或水平决策边界线性回归算法,可从训练数据中选择随机斜率和偏差。 唯一的方法缺陷分支切割数据点无法正确检测。

    Let’s code it :

    让我们编写代码:

    # Import the libraries from scipy import statsfrom sklearn.ensemble import IsolationForestfrom sklearn.neighbors import LocalOutlierFactorimport matplotlib.dates as mdfrom scipy.stats import norm%matplotlib inlineimport seaborn as snsfrom sklearn.model_selection import train_test_splitsns.set_style("whitegrid") #possible choices: white, dark, whitegrid, darkgrid, ticksimport eif as isoimport matplotlib.pyplot as pltimport plotly.express as pximport plotly.graph_objs as goimport plotly.figure_factory as fffrom plotly import toolsfrom plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplotpd.set_option('float_format', '{:f}'.format)pd.set_option('max_columns',250)pd.set_option('max_rows',150)#Define X_train, y_traina = df_enc.loc[:,~df_enc.columns.duplicated()]b = a.drop('rating', axis = 1)X = b.iloc[:,0:11]y = a.iloc[:,2]X_train,y_train, X_test,y_test = train_test_split(X, y, test_size=0.3)#Define Extend isolation forest clf = IsolationForest(max_samples='auto', random_state = 1, contamination= 0.02)# Transform data preds = clf.fit_predict(X)# Define outliers bondaries structure df['isoletionForest_outlier’s'] = predsdf['isoletionForest_outlier’s'] = df['isoletionForest_outlier’s'].astype(str)df['isoletionForest_scores'] = clf.decision_function(X)# Counts the chocolate rating outliers :print(df['isoletionForest_outlier’s'].value_counts())

    Output

    输出量

    1 2179-1 45Name: isoletionForest_outlier’s, dtype: int64

    The algorithm found 45 outliers. Let’s plot the results:

    该算法找到了45个离群值。 让我们来绘制结果:

    fig, ax = plt.subplots(figsize=(30, 7))ax.set_title('Extended Outlier Factor Scores Outlier Detection', fontsize = 15, loc='center')plt.scatter(X.iloc[:, 0], X.iloc[:, 1], color='g', s=3., label='Data points')radius = (df['isoletionForest_scores'].max() - df['isoletionForest_scores']) / (df['isoletionForest_scores'].max() - df['isoletionForest_scores'].min())plt.scatter(X.iloc[:, 0], X.iloc[:, 1], s=2000 * radius, edgecolors='r', facecolors='none', label='Outlier scores')plt.axis('tight')legend = plt.legend(loc='upper left')legend.legendHandles[0]._sizes = [10]legend.legendHandles[1]._sizes = [20]plt.show() Parametric circle outliers detection using extended outlier factor scores 使用扩展离群因子得分的参数圆离群值检测

    This plot describes the outliers by using the size of the circle, larger circles are detected as outliers.

    此图通过使用圆的大小来描述离群值,将较大的圆检测为离群值。

    Photo by Kristiana Pinne on Unsplash Kristiana Pinne在 Unsplash上 拍摄的照片

    5. Local outlier’s method

    5. 局部离群值的方法

    Local Outlier Factor is the only method comparing data points densities. If density points are smaller from other vicinities density points the outlier factor will be negative (- 1) hence, this point can be considered as an outlier.

    局部离群因子是比较数据点密度的唯一方法。 如果密度点比其他附近的密度点小,则离群因子将为负(-1),因此,该点可以视为离群点。

    # Import the librariesfrom scipy import statsfrom sklearn.neighbors import LocalOutlierFactorimport matplotlib.dates as mdfrom scipy.stats import norm%matplotlib inlineimport seaborn as snsfrom sklearn.model_selection import train_test_splitsns.set_style("whitegrid") #possible choices: white, dark, whitegrid, darkgrid, ticksimport eif as isoimport matplotlib.pyplot as pltimport plotly.express as pximport plotly.graph_objs as goimport plotly.figure_factory as fffrom plotly import toolsfrom plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplotpd.set_option('float_format', '{:f}'.format)pd.set_option('max_columns',250)pd.set_option('max_rows',150)#Define X_train, y_traina = df_enc.loc[:,~df_enc.columns.duplicated()]b = a.drop('rating', axis = 1)X = b.iloc[:,0:11]y = a.iloc[:,2]X_train,y_train, X_test,y_test = train_test_split(X, y, test_size=0.3)#Define Extend isolation forest clf = LocalOutlierFactor(n_neighbors=11)# Transform data :y_pred = clf.fit_predict(X)# Define outliers bondaries structure df['localOutlierFactor_outlier’s'] = y_pred.astype(str)df['localOutlierFactor_scores'] = clf.negative_outlier_factor_# Counts the chocolate rating outliers print(df['localOutlierFactor_outlier’s'].value_counts())

    Output

    输出量

    1 2161-1 63Name: localOutlierFactor_outlier’s, dtype: int64

    The algorithm found 63 outliers. Let’s plot the results:

    该算法找到了63个离群值。 让我们来绘制结果:

    fig, ax = plt.subplots(figsize=(30, 7))ax.set_title('Local Outlier Factor Scores Outlier Detection', fontsize = 15, loc='center')plt.scatter(X.iloc[:, 0], X.iloc[:, 1], color='g', s=3., label='Data points')radius = (df['localOutlierFactor_scores'].max() - df['localOutlierFactor_scores']) / (df['localOutlierFactor_scores'].max() - df['localOutlierFactor_scores'].min())plt.scatter(X.iloc[:, 0], X.iloc[:, 1], s=2000 * radius, edgecolors='r', facecolors='none', label='Outlier scores')plt.axis('tight')legend = plt.legend(loc='upper left')legend.legendHandles[0]._sizes = [10]legend.legendHandles[1]._sizes = [20]plt.show() Parametric circle outliers detection using local outlier factor scores 使用局部离群因子评分的参数圆离群值检测

    This plot describes the outliers by using the size of the circle, larger circles are detected as outliers.

    此图通过使用圆的大小来描述离群值,将较大的圆检测为离群值。

    Antonio Castellano on 安东尼奥·卡斯特拉诺 ( Unsplash Union Splash)摄

    If you have some spare time I’d recommend, you’ll read this:

    如果您有空闲时间,建议您阅读以下内容:

    https://towardsdatascience.com/5-ways-to-detect-outliers-that-every-data-scientist-should-know-python-code-70a54335a623

    https://towardsdatascience.com/5-ways-to-detect-outliers-that-every-data-scientist-should-know-python-code-70a54335a623

    Sum Up

    总结

    Refer to this link :

    参考此链接:

    https://jovian.ml/yeonathan/5-best-outliers-detection-methods-2020

    https://jovian.ml/yeonathan/5-best-outliers-detection-methods-2020

    for complete outlier’s detections of chocolate bar dataset using these methods.

    使用这些方法对巧克力块数据进行完整的异常值检测。

    Using several outlier detection methods is essential in data science, this brief overview is a reminder of the importance of using multiples outlier’s elimination methods. Hence, extracting the best data values, as well as share useful documentation.

    在数据科学中,使用几种离群值检测方法是必不可少的,此简要概述提醒人们使用倍数离群值消除方法的重要性。 因此,提取最佳数据值,并共享有用的文档。

    I hope you enjoy it, keep exploring!

    希望您喜欢它,继续探索!

    Photo by Arsenie Posuponko on Unsplash 照片由 Arsenie Posuponko在 Unsplash上 拍摄

    翻译自: https://towardsdatascience.com/what-you-always-wanted-to-know-about-outliers-detection-but-never-dared-to-ask-f040d9ca64d9

    异常值检测

    Processed: 0.025, SQL: 8