skewness z 分数

    科技2022-08-01  94

    skewness z 分数

    重点 (Top highlight)

    Most of the time I write longer articles on data science topics but recently I’ve been thinking about writing small, bite-sized pieces around specific concepts, algorithms and applications. This is my first attempt in that direction, hoping people will like these pieces.

    大多数时候,我都会撰写有关数据科学主题的较长文章,但最近我一直在考虑围绕特定概念,算法和应用程序编写小巧的文章。 这是我朝这个方向的第一次尝试,希望人们会喜欢这些作品。

    In today’s “small-bite” I’m writing about Z-score in the context of anomaly detection.

    在当今的“小叮咬”中,我是在异常检测的背景下编写有关Z分数的文章。

    Anomaly detection is a process for identifying unexpected data, event or behavior that require some examination. It is a well-established field within data science and there is a large number of algorithms to detect anomalies in a dataset depending on data type and business context. Z-score is probably the simplest algorithm that can rapidly screen candidates for further examination to determine whether they are suspicious or not.

    异常检测是识别需要检查的意外数据,事件或行为的过程。 这是数据科学领域一个公认的领域,根据数据类型和业务环境,有大量算法可以检测数据集中的异常。 Z评分可能是最简单的算法,可以快速筛选候选人以进行进一步检查,以确定他们是否可疑。

    What is Z-score

    什么是Z分数

    Simply speaking, Z-score is a statistical measure that tells you how far is a data point from the rest of the dataset. In a more technical term, Z-score tells how many standard deviations away a given observation is from the mean.

    简而言之,Z分数是一种统计量度,可告诉您数据点与数据集其余部分的距离。 用一个更专业的术语,Z分数可以告诉给定观察值与平均值之间有多少标准偏差。

    For example, a Z score of 2.5 means that the data point is 2.5 standard deviation far from the mean. And since it is far from the center, it’s flagged as an outlier/anomaly.

    例如,Z得分为2.5意味着数据点是远离平均值的2.5标准偏差。 而且由于距离中心较远,因此将其标记为离群值/异常。

    How it works?

    这个怎么运作?

    Z-score is a parametric measure and it takes two parameters — mean and standard deviation.

    Z分数是一个参数度量,它包含两个参数-平均值和标准偏差。

    Once you calculate these two parameters, finding the Z-score of a data point is easy.

    一旦计算了这两个参数,就很容易找到数据点的Z分数。

    Note that mean and standard deviation are calculated for the whole dataset, whereas x represents every single data point. That means, every data point will have its own z-score, whereas mean/standard deviation remains the same everywhere.

    请注意,均值和标准差是针对整个数据集计算的,而x表示每个单个数据点。 这意味着,每个数据点将具有其自己的z分数,而平均值/标准差在各处均保持相同。

    Example

    Below is a python implementation of Z-score with a few sample data points. I’m adding notes in each line of code to explain what’s going on.

    以下是带有一些示例数据点的Z分数的python实现。 我在每行代码中添加注释,以解释发生了什么。

    # import numpyimport numpy as np# random data points to calculate z-scoredata = [5, 5, 5, -99, 5, 5, 5, 5, 5, 5, 88, 5, 5, 5]# calculate meanmean = np.mean(data) # calculate standard deviationsd = np.std(data)# determine a threholdthreshold = 2# create empty list to store outliersoutliers = []# detect outlierfor i in data: z = (i-mean)/sd # calculate z-score if abs(z) > threshold: # identify outliers outliers.append(i) # add to the empty list# print outliers print("The detected outliers are: ", outliers)

    Caution and conclusion

    警告和结论

    If you play with these data you will notice a few things:

    如果您使用这些数据,您会注意到以下几点:

    There are 14 data points and Z-score correctly detected 2 outliers [-99 and 88]. However, if you remove five data points from the list it detects only 1 outlier [-99]. That means you need to have a certain number of data size for Z-score to work.

    有14个数据点,Z分数正确检测到2个异常值[-99和88]。 但是,如果从列表中删除五个数据点,它将仅检测到1个异常值[-99]。 这意味着您需要具有一定数量的数据大小才能使Z评分工作。 In large production datasets, Z-score works best if data are normally distributed (aka. Gaussian distribution).

    在大型生产数据集中,如果数据呈正态分布(也称为高斯分布),则Z得分效果最佳。 I used an arbitrary threshold of 2, beyond which all data points are flagged as outliers. The rule of thumb is to use 2, 2.5, 3 or 3.5 as threshold.

    我使用了一个任意阈值2,超过该阈值所有数据点都被标记为离群值。 经验法则是使用2、2.5、3或3.5作为阈值。

    Finally, Z-score is sensitive to extreme values, because the mean itself is sensitive to extreme values.

    最后, Z值对极值敏感,因为平均值本身对极值敏感 。

    Hope this was useful, feel free to get in touch via Twitter.

    希望这是有用的,请随时通过Twitter与我们联系。

    翻译自: https://towardsdatascience.com/z-score-for-anomaly-detection-d98b0006f510

    skewness z 分数

    Processed: 0.010, SQL: 8