knn scikit

    科技2023-11-27  72

    knn scikit

    介绍 (Introduction)

    k Nearest Neighbors (kNN) is a simple ML algorithm for classification and regression. Scikit-learn features both versions with a very simple API, making it popular in machine learning courses. There is one issue with it — it’s quite slow! But don’t worry, we can make it work for bigger datasets with the Facebook faiss library.

    k最近邻(kNN)是用于分类和回归的简单ML算法。 Scikit-learn的两个版本都具有非常简单的API,使其在机器学习课程中很受欢迎。 有一个问题-它很慢! 但是不用担心,我们可以使用Facebook faiss库使它适用于更大的数据集。

    The kNN algorithm has to find the nearest neighbors in the training set for the sample being classified. As the dimensionality (number of features) of the data increases, the time needed to find nearest neighbors rises very quickly. To speed up prediction, in the training phase (.fit() method) kNN classifiers create data structures to keep the training dataset in a more organized way, that will help with nearest neighbor searches.

    kNN算法必须在训练集中找到要分类的样本的最近邻居。 随着数据的维数(特征数量)的增加,找到最近邻居所需的时间非常Swift地增加。 为了加快预测速度,在训练阶段( .fit()方法),kNN分类器创建数据结构以使训练数据集保持更有条理,这将有助于进行最近的邻居搜索。

    Scikit学习vs Faiss (Scikit-learn vs faiss)

    In Scikit-learn, the default “auto” mode automatically chooses the algorithm, based on the training data size and structure. It’s either a brute force search (for very small datasets), or one of the popular data structures for nearest neighbor lookups, k-d tree or ball tree. They are simple, often taught at computational geometry courses, but efficiency of their implementation in Scikit-learn is questionable at best. For example, you may have seen choosing only a small part of the MNIST dataset in kNN tutorials, about 10k — the reason for this is that for the entire dataset, which is 60k images, it would be far too slow. And today this doesn’t even come close to “big data”!

    在Scikit-learn中,默认的“自动”模式会根据训练数据的大小和结构自动选择算法。 它要么是蛮力搜索(针对非常小的数据集),要么是用于最近邻居查找,kd树或球树的流行数据结构之一。 它们很简单,通常在计算几何课程中讲授,但是在Scikit-learn中实现它们的效率充其量是值得怀疑的。 例如,您可能已经在kNN教程中看到只选择了MNIST数据集的一小部分,大约为10k,这是因为对于整个数据集(60k图像)来说,它太慢了。 今天,这甚至还不能接近“大数据”!

    Fortunately, the Facebook AI Research (FAIR) came up with excellent implementations of the nearest neighbor search algorithms, which are available in the faiss library. faiss offers CPU and GPU support, many different metrics, makes use of multiple cores, GPUs and machines and much more. With it, we can implement k nearest neighbor classifier that is not a few times faster than Scikit-learn’s, but orders of magnitude faster!

    幸运的是,Facebook AI Research(FAIR)提出了faiss库中提供的最近邻居搜索算法的出色实现。 faiss提供CPU和GPU支持,许多不同的指标,利用多个内核,GPU和机器等等。 有了它,我们可以实现k个最近的邻居分类器,它的速度不是Scikit-learn的几倍,而是快几个数量级!

    用faiss实现kNN分类器 (Implementing kNN classifier with faiss)

    If you have trouble with the Github Gist below, the code is also available at my Github (link).

    如果您在使用下面的Github Gist时遇到问题,也可以在我的Github( 链接 )上找到该代码。

    A great feature of faiss is that is has both installation and build instructions (installation docs) and an excellent documentation with examples (getting started docs). After the installation, we can write the actual classifier. The code is quite simple, since we just mimic the Scikit-learn API.

    faiss的一个重要功能是既有安装和构建说明( 安装文档 ),又有带有示例的出色文档( 例如入门文档 )。 安装后,我们可以编写实际的分类器。 代码非常简单,因为我们只是模仿了Scikit-learn API。

    import numpy as np import faiss class FaissKNeighbors: def __init__(self, k=5): self.index = None self.y = None self.k = k def fit(self, X, y): self.index = faiss.IndexFlatL2(X.shape[1]) self.index.add(X.astype(np.float32)) self.y = y def predict(self, X): distances, indices = self.index.search(X.astype(np.float32), k=self.k) votes = self.y[indices] predictions = np.array([np.argmax(np.bincount(x)) for x in votes]) return predictions

    There are a few interesting elements here:

    这里有一些有趣的元素:

    the index attribute holds the data structure created by faiss to speed up nearest neighbor search

    index属性保存faiss创建的数据结构,以加快最近邻居的搜索

    data that we put in the index has to be the Numpy float32 type

    数据,我们把在该指数已是NumPy的FLOAT32类型

    we use IndexFlatL2 here, which is the simplest exact nearest neighbor search with Euclidean distance (L2 norm), very similar to the default Scikit-learn KNeighborsClassifier; you can also use other metrics (metrics docs) and types of indices (indices docs), e. g. for approximate nearest neighbor search

    我们在这里使用IndexFlatL2 ,它是最简单的,具有欧几里得距离(L2范数)的精确最近邻居搜索,与默认的Scikit-learn KNeighborsClassifier非常相似; 您还可以使用其他指标( metrics docs )和索引类型( indexs docs ),例如进行近似最近邻居搜索

    the .search() method returns distances to k nearest neighbors and their indices; we only care about the indices here, but you could e. g. implement distance weighted nearest neighbors with the additional information

    .search()方法返回到k个最近邻居及其索引的距离; 我们在这里只关心索引,但是您可以例如使用附加信息来实现距离加权最近的邻居

    indices returned by .search() are 2D matrix, where n-th row contains indices of the k nearest neighbors; with self.y[indices], we turn those indices into the classes of the nearest neighbors, so we know the votes for each sample

    .search()返回的索引是2D矩阵,其中第n行包含k个最近邻居的索引; 使用self.y[indices] ,我们将那些索引转换为最近邻居的类别,因此我们知道每个样本的投票

    np.argmax(np.bincount(x)) returns the most popular number from the x array, which is the predicted class; we do this for every row, i. e. for every sample that we have to classify

    np.argmax(np.bincount(x))返回x数组中最受欢迎的数字,它是预测的类; 我们对每一行都执行此操作,即对于我们必须分类的每个样本

    时间比较 (Time comparison)

    I’ve chosen a few popular datasets available in Scikit-learn for comparison. The train and predict times are compared. For easier reading, I’ve explicitly written how many times faster is the faiss-based classifier than Scikit-learn’s.

    我选择了一些Scikit学习中可用的流行数据集进行比较。 比较火车和预测时间。 为了更容易阅读,我明确写了基于faiss的分类器比Scikit-learn的分类器快多少倍。

    All of those times have been measured with the time.process_time() function, that measures process time instead of wall clock time, for more accurate results. Results are averages of 5 runs.

    所有这些时间都使用time.process_time()函数进行了测量,该函数测量进程时间而不是挂钟时间,以获得更准确的结果。 结果是5次运行的平均值。

    Train times (image by author) 火车时间(作者提供的图片) Predict times (image by author) 预测时间(作者提供的图片)

    On average, training is almost 300 times faster, while prediction is about 7.5 times faster on average. Also note that for the MNIST dataset, which has size realistic for modern datasets, we get 17 times speedup, which is huge. I was ready to give up with Scikit-learn after 10 minutes, and it took almost 15 minutes CPU time (the wall clock was even longer)!

    平均而言,训练速度快了将近300倍,而预测速度平均快了7.5倍。 还要注意,对于MNIST数据集(其大小对于现代数据集来说是现实的),我们获得了17倍的加速,这是巨大的。 我准备在10分钟后放弃Scikit学习,而这花费了将近15分钟的CPU时间(挂钟甚至更长)!

    摘要 (Summary)

    With 20 lines of code, we get a huge speed boost for kNN classifier with faiss library. If you need, you can get even better with GPU, multiple GPUs, approximate nearest neighbor search and much more, which is nicely explained in faiss docs.

    通过20行代码,我们可以使用faiss库大大提高kNN分类器的速度。 如果需要,使用GPU,多个GPU,近似最近的邻居搜索等等,您可以获得更好的结果,在faiss文档中对此进行了很好的解释。

    翻译自: https://towardsdatascience.com/make-knn-300-times-faster-than-scikit-learns-in-20-lines-5e29d74e76bb

    knn scikit

    相关资源:大数据下的快速KNN分类算法
    Processed: 0.012, SQL: 8