skewness z 分数

科技2022-08-01 94

skewness z 分数

重点 (Top highlight)

Most of the time I write longer articles on data science topics but recently I’ve been thinking about writing small, bite-sized pieces around specific concepts, algorithms and applications. This is my first attempt in that direction, hoping people will like these pieces.

大多数时候，我都会撰写有关数据科学主题的较长文章，但最近我一直在考虑围绕特定概念，算法和应用程序编写小巧的文章。这是我朝这个方向的第一次尝试，希望人们会喜欢这些作品。

In today’s “small-bite” I’m writing about Z-score in the context of anomaly detection.

在当今的“小叮咬”中，我是在异常检测的背景下编写有关Z分数的文章。

Anomaly detection is a process for identifying unexpected data, event or behavior that require some examination. It is a well-established field within data science and there is a large number of algorithms to detect anomalies in a dataset depending on data type and business context. Z-score is probably the simplest algorithm that can rapidly screen candidates for further examination to determine whether they are suspicious or not.

异常检测是识别需要检查的意外数据，事件或行为的过程。这是数据科学领域一个公认的领域，根据数据类型和业务环境，有大量算法可以检测数据集中的异常。 Z评分可能是最简单的算法，可以快速筛选候选人以进行进一步检查，以确定他们是否可疑。

What is Z-score

什么是Z分数

Simply speaking, Z-score is a statistical measure that tells you how far is a data point from the rest of the dataset. In a more technical term, Z-score tells how many standard deviations away a given observation is from the mean.

简而言之，Z分数是一种统计量度，可告诉您数据点与数据集其余部分的距离。用一个更专业的术语，Z分数可以告诉给定观察值与平均值之间有多少标准偏差。

For example, a Z score of 2.5 means that the data point is 2.5 standard deviation far from the mean. And since it is far from the center, it’s flagged as an outlier/anomaly.

例如，Z得分为2.5意味着数据点是远离平均值的2.5标准偏差。而且由于距离中心较远，因此将其标记为离群值/异常。

How it works?

这个怎么运作？

Z-score is a parametric measure and it takes two parameters — mean and standard deviation.

Z分数是一个参数度量，它包含两个参数-平均值和标准偏差。

Once you calculate these two parameters, finding the Z-score of a data point is easy.

一旦计算了这两个参数，就很容易找到数据点的Z分数。

Note that mean and standard deviation are calculated for the whole dataset, whereas x represents every single data point. That means, every data point will have its own z-score, whereas mean/standard deviation remains the same everywhere.

请注意，均值和标准差是针对整个数据集计算的，而x表示每个单个数据点。这意味着，每个数据点将具有其自己的z分数，而平均值/标准差在各处均保持相同。

Example

例

Below is a python implementation of Z-score with a few sample data points. I’m adding notes in each line of code to explain what’s going on.

以下是带有一些示例数据点的Z分数的python实现。我在每行代码中添加注释，以解释发生了什么。

# import numpyimport numpy as np# random data points to calculate z-scoredata = [5, 5, 5, -99, 5, 5, 5, 5, 5, 5, 88, 5, 5, 5]# calculate meanmean = np.mean(data) # calculate standard deviationsd = np.std(data)# determine a threholdthreshold = 2# create empty list to store outliersoutliers = []# detect outlierfor i in data: z = (i-mean)/sd # calculate z-score if abs(z) > threshold: # identify outliers outliers.append(i) # add to the empty list# print outliers print("The detected outliers are: ", outliers)

Caution and conclusion

警告和结论

If you play with these data you will notice a few things:

如果您使用这些数据，您会注意到以下几点：

There are 14 data points and Z-score correctly detected 2 outliers [-99 and 88]. However, if you remove five data points from the list it detects only 1 outlier [-99]. That means you need to have a certain number of data size for Z-score to work.

有14个数据点，Z分数正确检测到2个异常值[-99和88]。但是，如果从列表中删除五个数据点，它将仅检测到1个异常值[-99]。这意味着您需要具有一定数量的数据大小才能使Z评分工作。 In large production datasets, Z-score works best if data are normally distributed (aka. Gaussian distribution).

在大型生产数据集中，如果数据呈正态分布(也称为高斯分布)，则Z得分效果最佳。 I used an arbitrary threshold of 2, beyond which all data points are flagged as outliers. The rule of thumb is to use 2, 2.5, 3 or 3.5 as threshold.

我使用了一个任意阈值2，超过该阈值所有数据点都被标记为离群值。经验法则是使用2、2.5、3或3.5作为阈值。

Finally, Z-score is sensitive to extreme values, because the mean itself is sensitive to extreme values.

最后， Z值对极值敏感，因为平均值本身对极值敏感。

Hope this was useful, feel free to get in touch via Twitter.

希望这是有用的，请随时通过Twitter与我们联系。

翻译自: https://towardsdatascience.com/z-score-for-anomaly-detection-d98b0006f510

skewness z 分数

Processed: 0.010, SQL: 8