k均值聚类用于大数据

科技2022-07-12 153

k均值聚类用于大数据

介绍 (Introduction)

K-means is one of the most widely used unsupervised clustering methods.

K均值是最广泛使用的无监督聚类方法之一。

The K-means algorithm clusters the data at hand by trying to separate samples into K groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.

K-means算法通过尝试将样本分成等方差相等的K组来最小化称为惯性或集群内平方和的准则，从而将手头的数据聚类。该算法要求指定簇数。它可以很好地扩展到大量样品，并已在许多不同领域的广泛应用领域中使用。

The k-means algorithm divides a set of N samples (stored in a data matrix X) into K disjoint clusters C, each described by the mean μj of the samples in the cluster. The means are commonly called the cluster “centroids”.

k均值算法将N个样本(存储在数据矩阵X中 )的集合划分为K个不相交的簇C ，每个簇用均值μj描述集群中的样本数量。该方法通常称为簇“ 质心”。

K-means algorithm falls into the family of unsupervised machine learning algorithms/methods. For this family of models, the research needs to have at hand a dataset with some observations without the need of having also the labels/classes of the observations. Unsupervised learning studies how systems can infer a function to describe a hidden structure from unlabeled data.

K-means算法属于无监督机器学习算法/方法家族。对于该系列模型，研究需要手头有一些观测值的数据集，而无需同时具有观测值的标签 / 类别。无监督学习研究系统如何从未标记的数据中推断出描述隐藏结构的功能。

Now let’s discover the mathematical foundations of the algorithm.

现在让我们发现算法的数学基础。

数学公式 (Mathematical formulation)

Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS) (i.e. variance). Formally, the objective is defined as follows:

给定的一组观察(X 1，X 2，...，X n)，其中每个观察是d维实向量，k均值聚类目标的n个观察划分成k(≤n)的集合S = {S 1， S 2，…， Sk }，以最小化集群内平方和(WCSS)(即方差 )。正式地，目标定义如下：

Latex image created by the author. 作者创建的乳胶图像。

where μi is the mean of points in group/cluster Si.

其中，μi是平均组/集群的Si分。

了解背后的算法 (Understanding the algorithm behind)

Example of some Voronoi diagrams (source: https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py). 某些Voronoi图的示例(来源： https ： //scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py )。

The algorithm can also be understood through the concept of Voronoi diagrams.

也可以通过Voronoi图的概念来理解该算法。

First, the Voronoi diagram of the points is calculated using the current centroids. Initially, the centroids are usually chosen randomly but this depends on the underlying package / library / software used.

首先，使用当前质心计算点的Voronoi图。最初，质心通常是随机选择的，但这取决于所使用的基础包/库/软件。

Each segment in the Voronoi diagram becomes a separate cluster.

Voronoi图中的每个段都成为一个单独的群集。

Secondly, the centroids are updated to the mean of each segment. The algorithm then repeats this until a stopping criterion is fulfilled.

其次，质心更新为每个段的均值。然后，算法重复此操作，直到满足停止条件为止。

The algorithm stops when the relative decrease in the objective function between iterations is less than the given tolerance (ε) value or when centroids move (in space) less than the tolerance.

当迭代之间目标函数的相对减小小于给定的容差 (ε)值时，或者当质心移动 (在空间中) 小于容差时，算法将停止。

寻找最佳集群数 (Finding the optimum number of clusters)

As stated in the introduction, the K-Means algorithm requires the number of clusters to be specified beforehand. In case that there is no valid assumption about the underlying number of clusters in the data, this task can be difficult to tackle.

如引言中所述， K-Means算法要求事先指定簇数。如果没有关于数据中基本簇数的有效假设，则可能很难解决此任务。

Fortunately, there are some methods for estimating the optimum number of clusters in our data such as the Silhouette Coefficient or the Elbow method. If the ground truth labels are not known, evaluation must be performed using the model itself. In this article we will only use the Silhouette Coefficient and not the Elbow method which is much simpler. Briefly, the elbow method looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t give much better modeling of the data.

幸运的是，有一些方法可以估算出我们数据中的最佳聚类数，例如轮廓系数或肘形方法。如果不知道地面真相标签，则必须使用模型本身进行评估。在本文中，我们将仅使用轮廓系数，而不是更简单的Elbow方法。简而言之，弯头法着眼于方差百分比，该方差百分比被解释为聚类数量的函数：应该选择多个聚类，以便添加另一个聚类不会对数据进行更好的建模。

A higher Silhouette Coefficient score relates to a model with better defined clusters. The Silhouette Coefficient is defined for each sample and is composed of two scores:

较高的轮廓系数得分与具有更好定义的聚类的模型有关。为每个样本定义了轮廓系数，该系数由两个分数组成：

a: The mean distance between a sample and all other points in the same class.

a ：样本与同一类别中所有其他点之间的平均距离。

b: The mean distance between a sample and all other points in the next nearest cluster.

b ：样本与下一个最近的簇中所有其他点之间的平均距离。

The Silhouette Coefficient s for a single sample is then defined as:

轮廓系数S 对单个样品被定义为：

Latex image created by the author. 作者创建的乳胶图像。

Finally, the Total Silhouette Coefficient for a set of samples is given as the mean of the Silhouette Coefficient for each sample.

最后，将一组样本的总轮廓系数作为每个样本的轮廓系数的平均值。

The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

最佳值为1，最差值为-1。接近0的值表示重叠的群集。负值通常表示样本已分配给错误的聚类，因为不同的聚类更为相似。

Python工作示例 (Python working example)

For this example we will create artificial data i.e. artificial clusters. This way we will know in advance the ground through i.e. the exact number of clusters in our dataset.

对于此示例，我们将创建人工数据，即人工集群。这样，我们将通过数据集中的确切簇数预先了解地面。

Let’s start with importing the required python libraries:

让我们从导入所需的python库开始：

from sklearn.datasets import make_blobsfrom sklearn.cluster import KMeansfrom sklearn.metrics import silhouette_samples, silhouette_scoreimport matplotlib.pyplot as pltimport matplotlib.cm as cmimport numpy as np

Next, let’s create some artificial data containing 500 samples, 2 features/variables, and K=4 clusters.

接下来，让我们创建一些包含500个样本， 2个特征/变量和K = 4个簇的人工数据。

# Generating the data# This particular setting has one distinct cluster and 3 clusters placed close together.X, y = make_blobs(n_samples=500, n_features=2, centers=4, cluster_std=1, center_box=(-10.0, 10.0), shuffle=True, random_state=1)

We know that we have K=4 clusters in the data however, in order to understand how the Silhouette Score works we will fit the model using a range of different number of clusters.

我们知道数据中有K = 4个聚类，但是，为了了解轮廓分数是如何工作的，我们将使用一系列不同数量的聚类来拟合模型。

Each time, we will estimate the Silhouette Score and also plot the data with the final (converged) centroids. All these are done by the following code:

每次，我们将估算轮廓分数，并使用最终( 收敛的 ) 质心绘制数据。所有这些都是通过以下代码完成的：

range_n_clusters = [3, 4, 5]for n_clusters in range_n_clusters: # Create a subplot with 1 row and 2 columns fig, (ax1, ax2) = plt.subplots(1, 2) fig.set_size_inches(18, 7) # The 1st subplot is the silhouette plot # The silhouette coefficient can range from -1, 1 but in this example all # lie within [-0.1, 1] ax1.set_xlim([-0.1, 1]) # The (n_clusters+1)*10 is for inserting blank space between silhouette # plots of individual clusters, to demarcate them clearly. ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10]) # Initialize the clusterer with n_clusters value and a random generator # seed of 10 for reproducibility. clusterer = KMeans(n_clusters=n_clusters, random_state=10) cluster_labels = clusterer.fit_predict(X) # The silhouette_score gives the average value for all the samples. # This gives a perspective into the density and separation of the formed # clusters silhouette_avg = silhouette_score(X, cluster_labels) print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg) # Compute the silhouette scores for each sample sample_silhouette_values = silhouette_samples(X, cluster_labels) y_lower = 10 for i in range(n_clusters): # Aggregate the silhouette scores for samples belonging to # cluster i, and sort them ith_cluster_silhouette_values = \ sample_silhouette_values[cluster_labels == i] ith_cluster_silhouette_values.sort() size_cluster_i = ith_cluster_silhouette_values.shape[0] y_upper = y_lower + size_cluster_i color = cm.nipy_spectral(float(i) / n_clusters) ax1.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhouette_values, facecolor=color, edgecolor=color, alpha=0.7) # Label the silhouette plots with their cluster numbers at the middle ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i)) # Compute the new y_lower for next plot y_lower = y_upper + 10 # 10 for the 0 samples ax1.set_title("The silhouette plot for the various clusters.") ax1.set_xlabel("The silhouette coefficient values") ax1.set_ylabel("Cluster label") # The vertical line for average silhouette score of all the values ax1.axvline(x=silhouette_avg, color="red", linestyle="--") ax1.set_yticks([]) # Clear the yaxis labels / ticks ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1]) # 2nd Plot showing the actual clusters formed colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters) ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7, c=colors, edgecolor='k') # Labeling the clusters centers = clusterer.cluster_centers_ # Draw white circles at cluster centers ax2.scatter(centers[:, 0], centers[:, 1], marker='o', c="white", alpha=1, s=200, edgecolor='k') for i, c in enumerate(centers): ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50, edgecolor='k') ax2.set_title("The visualization of the clustered data.") ax2.set_xlabel("Feature space for the 1st feature") ax2.set_ylabel("Feature space for the 2nd feature")plt.suptitle(("Silhouette analysis for KMeans clustering on sample data " "with n_clusters = %d" % n_clusters), fontsize=14, fontweight='bold')plt.show() Code output in the console (image created by the author) 控制台中的代码输出(作者创建的图像)

We observe that the average/mean Silhouette Score is the highest in the case of K=4 clusters.

我们观察到，在K = 4簇的情况下，平均/平均轮廓分数最高。

This verifies that the Silhouette Score is a good measure of K-means fitting goodness.

这验证了轮廓分数是衡量K均值拟合优度的好方法。

We also produced these Figures:

我们还制作了这些图：

converged) 融合 ) centroids (image created by the author) 质心的数据 (作者创建的图像) converged) 融合 ) centroids (image created by the author) 质心的数据 (作者创建的图像) converged) 融合 ) centroids (image created by the author) 质心的数据 (作者创建的图像)

The vertical line is the average silhouette score of all the values.

垂直线是所有值的平均轮廓分数。

Again, we can also visually verify that the Silhouette Score is a good measure of K-means fitting goodness for the specific example.

再次，我们还可以直观地验证“轮廓分数”是否可以很好地衡量特定示例的K均值拟合优度。

结论 (Conclusions)

K-means is one of the most widely used unsupervised clustering methods. The algorithm clusters the data at hand by trying to separate samples into K groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified.

K均值是最广泛使用的无监督聚类方法之一。该算法通过尝试将样本分为等方差的K组来最小化称为惯性或集群内平方和的准则，从而对手头的数据进行聚类。该算法要求指定簇数。

However, if the ground truth labels are not known, the evaluation must be performed using the model itself.

但是，如果不知道地面真相标签，则必须使用模型本身进行评估。

In this article we used the Silhouette Coefficient. From the working example, we can verify that the Silhouette Score is a good measure of K-means fitting goodness.

在本文中，我们使用了轮廓系数。从工作示例中，我们可以验证轮廓分数是衡量K均值拟合优度的好方法。

That’s all! Hope you liked this new article — more to come soon!

就这样！希望您喜欢这篇新文章，更多内容即将推出！

请继续关注并支持这项工作 (Stay tuned & support this effort)

If you liked and found this article useful, follow me to be able to see all my new posts.

如果您喜欢并认为本文有用，请关注我以查看我的所有新帖子。

Questions? Post them as a comment and I will reply as soon as possible.

有什么问题吗将其发布为评论，我会尽快回复。

与我取得联系 (Get in touch with me)

LinkedIn: https://www.linkedin.com/in/serafeim-loukas/

领英： https : //www.linkedin.com/in/serafeim-loukas/

ResearchGate: https://www.researchgate.net/profile/Serafeim_Loukas

ResearchGate ： https : //www.researchgate.net/profile/Serafeim_Loukas

翻译自: https://towardsdatascience.com/k-means-clustering-how-it-works-finding-the-optimum-number-of-clusters-in-the-data-13d18739255c

k均值聚类用于大数据

相关资源：一种新的最佳聚类数确定方法

Processed: 0.012, SQL: 8

k均值聚类用于大数据

介绍 (Introduction)

数学公式 (Mathematical formulation)

了解背后的算法 (Understanding the algorithm behind)

寻找最佳集群数 (Finding the optimum number of clusters)

Python工作示例 (Python working example)

结论 (Conclusions)

您可能还喜欢： (You may also like:)

请继续关注并支持这项工作 (Stay tuned & support this effort)

最新帖子 (Latest posts)

与我取得联系 (Get in touch with me)