skewness z 分数
Most of the time I write longer articles on data science topics but recently I’ve been thinking about writing small, bite-sized pieces around specific concepts, algorithms and applications. This is my first attempt in that direction, hoping people will like these pieces.
大多数时候,我都会撰写有关数据科学主题的较长文章,但最近我一直在考虑围绕特定概念,算法和应用程序编写小巧的文章。 这是我朝这个方向的第一次尝试,希望人们会喜欢这些作品。
In today’s “small-bite” I’m writing about Z-score in the context of anomaly detection.
在当今的“小叮咬”中,我是在异常检测的背景下编写有关Z分数的文章。
Anomaly detection is a process for identifying unexpected data, event or behavior that require some examination. It is a well-established field within data science and there is a large number of algorithms to detect anomalies in a dataset depending on data type and business context. Z-score is probably the simplest algorithm that can rapidly screen candidates for further examination to determine whether they are suspicious or not.
异常检测是识别需要检查的意外数据,事件或行为的过程。 这是数据科学领域一个公认的领域,根据数据类型和业务环境,有大量算法可以检测数据集中的异常。 Z评分可能是最简单的算法,可以快速筛选候选人以进行进一步检查,以确定他们是否可疑。
What is Z-score
什么是Z分数
Simply speaking, Z-score is a statistical measure that tells you how far is a data point from the rest of the dataset. In a more technical term, Z-score tells how many standard deviations away a given observation is from the mean.
简而言之,Z分数是一种统计量度,可告诉您数据点与数据集其余部分的距离。 用一个更专业的术语,Z分数可以告诉给定观察值与平均值之间有多少标准偏差。
For example, a Z score of 2.5 means that the data point is 2.5 standard deviation far from the mean. And since it is far from the center, it’s flagged as an outlier/anomaly.
例如,Z得分为2.5意味着数据点是远离平均值的2.5标准偏差。 而且由于距离中心较远,因此将其标记为离群值/异常。
How it works?
这个怎么运作?
Z-score is a parametric measure and it takes two parameters — mean and standard deviation.
Z分数是一个参数度量,它包含两个参数-平均值和标准偏差。
Once you calculate these two parameters, finding the Z-score of a data point is easy.
一旦计算了这两个参数,就很容易找到数据点的Z分数。
Note that mean and standard deviation are calculated for the whole dataset, whereas x represents every single data point. That means, every data point will have its own z-score, whereas mean/standard deviation remains the same everywhere.
请注意,均值和标准差是针对整个数据集计算的,而x表示每个单个数据点。 这意味着,每个数据点将具有其自己的z分数,而平均值/标准差在各处均保持相同。
Example
例
Below is a python implementation of Z-score with a few sample data points. I’m adding notes in each line of code to explain what’s going on.
以下是带有一些示例数据点的Z分数的python实现。 我在每行代码中添加注释,以解释发生了什么。
# import numpyimport numpy as np# random data points to calculate z-scoredata = [5, 5, 5, -99, 5, 5, 5, 5, 5, 5, 88, 5, 5, 5]# calculate meanmean = np.mean(data) # calculate standard deviationsd = np.std(data)# determine a threholdthreshold = 2# create empty list to store outliersoutliers = []# detect outlierfor i in data: z = (i-mean)/sd # calculate z-score if abs(z) > threshold: # identify outliers outliers.append(i) # add to the empty list# print outliers print("The detected outliers are: ", outliers)Caution and conclusion
警告和结论
If you play with these data you will notice a few things:
如果您使用这些数据,您会注意到以下几点:
There are 14 data points and Z-score correctly detected 2 outliers [-99 and 88]. However, if you remove five data points from the list it detects only 1 outlier [-99]. That means you need to have a certain number of data size for Z-score to work. 有14个数据点,Z分数正确检测到2个异常值[-99和88]。 但是,如果从列表中删除五个数据点,它将仅检测到1个异常值[-99]。 这意味着您需要具有一定数量的数据大小才能使Z评分工作。 In large production datasets, Z-score works best if data are normally distributed (aka. Gaussian distribution). 在大型生产数据集中,如果数据呈正态分布(也称为高斯分布),则Z得分效果最佳。 I used an arbitrary threshold of 2, beyond which all data points are flagged as outliers. The rule of thumb is to use 2, 2.5, 3 or 3.5 as threshold. 我使用了一个任意阈值2,超过该阈值所有数据点都被标记为离群值。 经验法则是使用2、2.5、3或3.5作为阈值。Finally, Z-score is sensitive to extreme values, because the mean itself is sensitive to extreme values.
最后, Z值对极值敏感,因为平均值本身对极值敏感 。
Hope this was useful, feel free to get in touch via Twitter.
希望这是有用的,请随时通过Twitter与我们联系。
翻译自: https://towardsdatascience.com/z-score-for-anomaly-detection-d98b0006f510
skewness z 分数