lightgbm

    科技2023-11-26  95

    lightgbm

    LightGBM, created by researchers at Microsoft, is an implementation of gradient boosted decision trees (GBDT) which is an ensemble method that combines decision trees (as weak learners) in a serial fashion (boosting).

    由微软研究人员创建的LightGBM是梯度增强决策树(GBDT)的一种实现,它是一种以串行方式(增强)组合决策树(作为弱学习者)的集成方法。

    Gradient boosted decision tree 梯度提升决策树

    Decision trees are combined in a way that each new learner fits the residuals from the previous tree so that the model improves. The final model aggregates the results from each step and a strong learner is achieved.

    决策树的组合方式是,每个新学习者都可以拟合前一棵树的残差,从而改进模型。 最终模型汇总了每个步骤的结果,并获得了强大的学习者。

    GBDT is so accurate that its implementations have been dominating major machine learning competitions.

    GBDT如此精确,以至于其实现一直主导着主要的机器学习竞赛。

    LightGBM背后的动机 (The motivation behind the LightGBM)

    Decision trees are built by splitting observations (i.e. data instances) based on feature values. This is how a decision tree “learns”. The algorithm looks for the best split which results in the highest information gain.

    通过基于特征值拆分观察值(即数据实例)来构建决策树。 这就是决策树“学习”的方式。 该算法寻找最佳分割,从而获得最高的信息增益。

    The information gain is basically the difference between entropy before and after the split. Entropy is a measure of uncertainty or randomness. The more randomness a variable has, the higher the entropy is. Thus, splits are done in a way that randomness is reduced.

    信息增益基本上是分裂前后的熵之差。 熵是不确定性或随机性的量度。 变量具有的随机性越多,熵就越高。 因此,以减少随机性的方式进行分割。

    Finding the best split turns out to be the most time-consuming part of the learning process of decision trees. The two algorithms by other GBDT implementations to find the best splits are:

    找到最佳拆分结果是决策树学习过程中最耗时的部分。 其他GBDT实现中找到最佳分割的两种算法是:

    Pre-sorted: Feature values are pre-sorted and all possible split points are evaluated.

    预先排序:特征值已预先排序,并评估了所有可能的分割点。 Histogram-based: Continuous features are divided into discrete bins which are used to create feature histograms.

    基于直方图:将连续要素分为离散的bin,用于创建要素直方图。

    “Sklearn GBDT” and “gbm in R” use the pre-sorted algorithm whereas “pGRT” uses the histogram-based algorithm. The “xgboost” supports both.

    “ Sklearn GBDT”和“ R中的gbm”使用预排序算法,而“ pGRT”使用基于直方图的算法。 “ xgboost”同时支持。

    The histogram-based algorithm is more efficient in terms of memory consumption and training speed. However, both pre-sorted and histogram-based get slower as the number of instances or features increases. LightGBM is aimed to solve this efficiency problem, especially with large datasets.

    就内存消耗和训练速度而言,基于直方图的算法效率更高。 但是,随着实例或特征数量的增加,预排序和基于直方图的速度都会变慢。 LightGBM旨在解决这一效率问题,尤其是对于大型数据集。

    是什么让LightGBM更高效 (What makes the LightGBM more efficient)

    The starting point for LightGBM was the histogram-based algorithm since it performs better than the pre-sorted algorithm.

    LightGBM的起点是基于直方图的算法,因为它的性能优于预排序算法。

    For each feature, all the data instances are scanned to find the best split with regards to the information gain. Thus, the complexity of the histogram-based algorithm is dominated by the number of data instances and features.

    对于每个功能,将扫描所有数据实例以找到关于信息增益的最佳划分。 因此,基于直方图的算法的复杂性由数据实例和特征的数量决定。

    To overcome this issue, LightGBM uses two techniques:

    为解决此问题,LightGBM使用两种技术:

    GOSS (Gradient One-Side Sampling)

    GOSS(梯度单边采样) EFB (Exclusive Feature Bundling)

    EFB(独家功能捆绑)

    Let’s go in detail about what these techniques do and how they make LightGBM “light”.

    让我们详细介绍这些技术的作用以及它们如何使LightGBM变得“轻巧”。

    GOSS(梯度单边采样) (GOSS (Gradient One-Side Sampling))

    We have mentioned that general GBDT implementations scan all data instances to find the best split. This is definitely not an optimal way.

    我们已经提到,一般的GBDT实现会扫描所有数据实例以找到最佳分割。 这绝对不是最佳方法。

    If we can manage to sample data based on information gain, the algorithm will be more effective. One way is sampling data based on their weights. However, it cannot be applied to GBDT since there is no sample weight in GBDT.

    如果我们能够根据信息增益对数据进行采样,则该算法将更加有效。 一种方法是根据数据的权重对其进行采样。 但是,由于GBDT中没有样本权重,因此无法将其应用于GBDT。

    GOSS addresses this issue by using gradients which give us valuable insight into the information gain.

    GOSS通过使用渐变来解决此问题,该渐变使我们对信息获取具有宝贵的见解。

    Small gradient: The algorithm has been trained on this instance and the error associated with it is small.

    小梯度:已经对该实例进行了算法训练,并且与之相关的误差很小。 Large gradient: The error associated with this instance is large so it will provide more information gain.

    大梯度:与此实例相关的误差很大,因此它将提供更多的信息增益。

    We can eliminate the instances with small gradients and only focus on the ones with large gradients. However, in that case, the data distribution will be changed. We do not want that because it will negatively affect the accuracy of the learned model.

    我们可以消除渐变较小的实例,而只关注渐变较大的实例。 但是,在这种情况下,数据分布将被更改。 我们不希望这样做,因为这会对学习模型的准确性产生负面影响。

    GOSS provides a way to sample data based on gradients while taking the data distribution into consideration.

    GOSS提供了一种在考虑数据分布的同时,基于梯度对数据进行采样的方法。

    Here is how it works:

    下面是它的工作原理:

    The data instances are sorted according to the absolute value of their gradients

    数据实例根据其梯度的绝对值排序 Top ax100% instances are selected

    选择了顶级ax100%实例From the remaining instances, a random sample of size bx100% is selected

    从其余实例中,选择大小为bx100%的随机样本The random sample of small gradients is multiplied by a constant equal to (1-a) / b when the information gain is calculated

    计算信息增益时,将小梯度的随机样本乘以等于(1-a)/ b的常数

    What GOSS eventually achieves is that the focus of the model leans towards the data instances that cause more loss (i.e. under-trained) while not affecting the data distribution much.

    GOSS最终实现的结果是,模型的重点倾向于导致更多损失(即,训练不足)而又不会对数据分布造成太大影响的数据实例。

    EFB(独家功能捆绑) (EFB (Exclusive Feature Bundling))

    Datasets with a high number of features are likely to have sparse features (i.e. lots of zero values). These sparse features are usually mutually exclusive which means they do not have non-zero values simultaneously. Consider the case of one-hot encoded text data. In a particular row, only one column indicating a specific word is non-zero and all other rows are zero.

    具有大量特征的数据集可能具有稀疏特征(即许多零值)。 这些稀疏特征通常是互斥的,这意味着它们不会同时具有非零值。 考虑一键编码文本数据的情况。 在特定行中,只有一列表示特定单词为非零,而所有其他行为零。

    EFB is a technique that uses a greedy algorithm to combine (or bundle) these mutually exclusive into a single feature (e.g. exclusive feature bundle) and thus reduce the dimensionality. EFB reduces the training time of GDBT without affecting the accuracy much because the complexity of creating feature histograms is now proportional to the number of bundles instead of the number of features (#bundles is much less than #features).

    EFB是一种使用贪心算法将相互排斥的特征组合(或捆绑)成单个特征(例如,排斥特征捆绑)并因此降低尺寸的技术。 EFB减少了GDBT的训练时间,而又不影响准确性,这是因为创建特征直方图的复杂度现在与束的数量成正比,而不是与特征的数量成比例(#bundles远小于#features)。

    One of the challenges with EFB is to find the optimal bundles. The researchers at Microsoft designed an algorithm that converts the bundling problem to a graph coloring problem.

    EFB的挑战之一是找到最佳捆绑。 微软的研究人员设计了一种算法,将捆绑问题转换为图形着色问题。

    In the graph coloring problem, the features are taken as vertices, and edges are added between features which are not mutually exclusive. Then a greedy algorithm is used to produce bundles.

    在图形着色问题中,将要素视为顶点,并在不互斥的要素之间添加边。 然后使用贪心算法生成束。

    To take it one step further, the algorithm also allows bundling of features that rarely have non-zero values simultaneously (i.e. almost mutually exclusive).

    更进一步,该算法还允许捆绑很少同时具有非零值(即几乎互斥)的特征。

    Another challenge is to merge the features in a bundle in a way that values of original features can be extracted. Consider a bundle of 3 features. We need to be able to identify the values of these 3 features using the value of the bundled feature.

    另一个挑战是将特征合并到一个捆绑包中,以便可以提取原始特征的值。 考虑捆绑3个功能。 我们需要能够使用捆绑特征的值来识别这三个特征的值。

    Recall that the histogram-based algorithm creates discrete bins for continuous values. To overcome the challenge of merging features, exclusive values of features in a bundle are put in different bins which can be achieved by adding offsets to the original feature values.

    回想一下,基于直方图的算法会为连续的值创建离散的bin。 为了克服合并要素的挑战,将捆绑中要素的排他值放在不同的分档中,这可以通过向原始要素值添加偏移量来实现。

    结论 (Conclusion)

    The motivation behind LightGBM is to solve the training speed and memory consumption issues associated with the conventional implementations of GBDTs when working with large datasets.

    LightGBM背后的动机是解决在处理大型数据集时与常规GBDT实施相关的训练速度和内存消耗问题。

    The goal is basically to reduce the size (both in terms of data instances and features) while preserving the information as much as possible. GOSS and EFB techniques are implemented to achieve this goal.

    基本目标是减小大小(在数据实例和功能方面),同时尽可能保留信息。 实施GOSS和EFB技术以实现此目标。

    According to the paper by the creators of LightGBM, “LightGBM speeds up the training process of conventional GBDT by up to over 20 times while achieving almost the same accuracy”.

    根据该文件由LightGBM的创造者,“LightGBM加快最多常规GBDT到20倍以上,同时实现几乎相同的精度的训练过程”。

    Thank you for reading. Please let me know if you have any feedback.

    感谢您的阅读。 如果您有任何反馈意见,请告诉我。

    LightGBM: A Highly Efficient Gradient Boosting Decision Tree

    LightGBM:高效的梯度提升决策树

    翻译自: https://towardsdatascience.com/understanding-the-lightgbm-772ca08aabfa

    lightgbm

    相关资源:四史答题软件安装包exe
    Processed: 0.015, SQL: 8