python中的adaboost分类器

科技2023-11-26 67

In recent years, boosting algorithms got huge popularity in data science or machine learning competitions. Most of the winners of these competitions use boosting algorithms to achieve high accuracy. These Data science competitions provide the global platform for learning, exploring, and providing solutions for various business and government problems. Boosting algorithms combine multiple low accuracy(or weak) models to create a high accuracy(or strong) models. It can be utilized in various domains such as credit, insurance, marketing, and sales. Boosting algorithms such as AdaBoost, Gradient Boosting, and XGBoost are widely used machine learning algorithm to win the data science competitions. In this tutorial, you are going to learn the AdaBoost ensemble boosting algorithm and the following topics will be covered:

近年来，Boosting算法在数据科学或机器学习竞赛中广受欢迎。这些比赛的大多数获胜者都使用增强算法来实现高精度。这些数据科学竞赛为学习，探索和提供各种商业和政府问题的解决方案提供了全球平台。 Boosting算法结合了多个低精度(或弱)模型来创建高精度(或强)模型。它可以用于信贷，保险，市场营销和销售等各个领域。诸如AdaBoost，Gradient Boosting和XGBoost之类的Boosting算法被广泛用于机器学习算法中，以赢得数据科学竞赛的冠军。在本教程中，您将学习AdaBoost集成提升算法，并将涵盖以下主题：

Ensemble Machine Learning Approach

集成机器学习方法 AdaBoost Classifier

AdaBoost分类器 How does the AdaBoost algorithm work?

AdaBoost算法如何工作？ Building Model in Python

用Python建立模型 Pros and cons

利弊 Conclusion

结论

For more such tutorials, projects, and courses visit DataCamp

有关更多此类教程，项目和课程的信息，请访问DataCamp

集成机器学习方法 (Ensemble Machine Learning Approach)

An ensemble is a composite model, combines a series of low performing classifiers with the aim of creating an improved classifier. Here, individual classifier vote and final prediction label returned that performs majority voting. Ensembles offer more accuracy than individual or base classifiers. Ensemble methods can parallelize by allocating each base learner to different-different machines. Finally, you can say, Ensemble learning methods are meta-algorithms that combine several machine learning methods into a single predictive model to increase performance. Ensemble methods can decrease variance using the bagging approach, bias using a boosting approach, or improve predictions using the stacking approach.

集成是一个组合模型，它结合了一系列性能不佳的分类器，目的是创建一个改进的分类器。在此，返回了执行多数表决的单个分类器表决和最终预测标签。集成提供比单个或基本分类器更高的准确性。集成方法可以通过将每个基础学习者分配给不同的机器来并行化。最后，您可以说，集成学习方法是一种元算法，它将几种机器学习方法组合到一个预测模型中以提高性能。集成方法可以使用装袋方法减少方差，使用增强方法使用偏差，或者使用堆叠方法来改善预测。

Bagging stands for bootstrap aggregation. combines multiple learners in a way to reduce the variance of estimates. For example, random forest trains M Decision Tree, you can train M different trees on different random subsets of the data and perform voting for final prediction. Bagging ensembles methods are Random Forest and Extra Trees.

套袋代表引导聚合。以减少估计差异的方式组合了多个学习者。例如，随机森林训练了M个决策树，您可以在数据的不同随机子集上训练M个不同的树，并对最终预测执行投票。套袋合奏的方法是随机森林和多余的树木。

Boosting algorithms are a set of low accurate classifier to create a highly accurate classifier. Low accuracy classifier (or weak classifier) offers the accuracy better than the flipping of a coin. A highly accurate classifier( or strong classifier) offers an error rate close to 0. The boosting algorithm can track the model who failed the accurate prediction. Boosting algorithms are less affected by the overfitting problem. The following three algorithms have gained huge popularity in data science competitions.

Boosting算法是一组低精度分类器，用于创建高精度分类器。低精度分类器(或弱分类器)提供的精度比硬币翻转更好。高度准确的分类器(或强分类器)提供接近0的错误率。提升算法可以跟踪未通过准确预测的模型。提升算法受过度拟合问题的影响较小。以下三种算法在数据科学竞赛中获得了极大的普及。

AdaBoost (Adaptive Boosting)

AdaBoost(自适应增强) Gradient Tree Boosting

梯度树增强 XGBoost

XGBoost

3. Stacking(or stacked generalization) is an ensemble learning technique that combines multiple base classification models predictions into a new data set. This new data are treated as the input data for another classifier. This classifier employed to solve this problem. Stacking is often referred to as blending.

3.堆叠(或堆叠概括)是一种集成学习技术，它将多个基本分类模型预测组合到一个新的数据集中。将此新数据视为另一个分类器的输入数据。该分类器用于解决此问题。堆叠通常称为混合。

On the basis of the arrangement of base learners, ensemble methods can be divided into two groups: In parallel ensemble methods, base learners are generated in parallel for example. Random Forest. In sequential ensemble methods, base learners are generated sequentially for example AdaBoost.

根据基础学习者的安排，合奏方法可以分为两组：在并行合奏方法中，例如，并行生成基础学习器。随机森林。在顺序集成方法中，基础学习器是顺序生成的，例如AdaBoost。

On the basis of the type of base learners, ensemble methods can be divided into two groups: the homogenous ensemble method uses the same type of base learner in each iteration. heterogeneous ensemble method uses the different types of base learners in each iteration.

根据基础学习器的类型，合奏方法可以分为两类：均匀合奏方法在每次迭代中使用相同类型的基础学习器。异构集成方法在每次迭代中使用不同类型的基础学习器。

AdaBoost分类器 (AdaBoost Classifier)

Ada-boost or Adaptive Boosting is one of the ensemble boosting classifiers proposed by Yoav Freund and Robert Schapire in 1996. It combines multiple classifiers to increase the accuracy of classifiers. AdaBoost is an iterative ensemble method. AdaBoost classifier builds a strong classifier by combining multiple poorly performing classifiers so that you will get high accuracy strong classifier. The basic concept behind Adaboost is to set the weights of classifiers and training data samples in each iteration such that it ensures the accurate predictions of unusual observations. Any machine learning algorithm can be used as a base classifier if it accepts weights on the training set. Adaboost should meet two conditions:

Ada-boost或Adaptive Boosting是Yoav Freund和Robert Schapire在1996年提出的整体提升分类器之一。它结合了多个分类器以提高分类器的准确性。 AdaBoost是一种迭代集成方法。 AdaBoost分类器通过组合多个性能不佳的分类器来构建一个强分类器，这样您将获得高精度的强分类器。 Adaboost背后的基本概念是在每次迭代中设置分类器和训练数据样本的权重，以确保对异常观察的准确预测。如果机器学习算法接受训练集上的权重，则可以用作基础分类器。 Adaboost应该满足两个条件：

The classifier should be trained iteratively on various weighted training examples.

分类器应在各种加权训练示例上进行迭代训练。 In each iteration, It tries to provide a good fit to these examples by minimizing training error.

在每次迭代中，它都试图通过最大程度地减少训练误差来很好地适合这些示例。

AdaBoost算法如何工作？ (How does the AdaBoost algorithm work?)

It works in the following steps:

它按以下步骤工作：

Initially, Adaboost selects a training subset randomly.

最初，Adaboost随机选择一个训练子集。 It iteratively trains the AdaBoost machine learning model by selecting the training set based on the accurate prediction of the last training.

通过基于上次训练的准确预测来选择训练集，从而迭代训练AdaBoost机器学习模型。 It assigns the higher weight to wrong classified observations so that in the next iteration these observations will get a high probability for classification.

它将较高的权重分配给错误的分类观测值，以便在下一次迭代中这些观测值将获得较高的分类概率。 Also, It assigns the weight to the trained classifier in each iteration according to the accuracy of the classifier. The more accurate classifier will get high weight.

而且，它根据分类器的准确性在每次迭代中将权重分配给经过训练的分类器。分类器越准确，权重越高。 This process iterate until the complete training data fits without any error or until reached the specified maximum number of estimators.

反复进行此过程，直到完整的训练数据正确无误或达到指定的最大估计量为止。 To classify, perform a “vote” across all of the learning algorithms you built.

要进行分类，请对您构建的所有学习算法进行“投票”。

用Python建立模型 (Building Model in Python)

导入所需的库 (Importing Required Libraries)

Let’s first load the required libraries.

首先加载所需的库。

# Load librariesfrom sklearn.ensemble import AdaBoostClassifierfrom sklearn import datasets# Import train_test_split functionfrom sklearn.model_selection import train_test_split#Import scikit-learn metrics module for accuracy calculationfrom sklearn import metrics

加载数据集 (Loading Dataset)

In the model the building part, you can use the IRIS dataset, which is a very famous multi-class classification problem. This dataset comprises 4 features (sepal length, sepal width, petal length, petal width) and a target (the type of flower). This data has three types of flower classes: Setosa, Versicolour, and Virginia. The dataset is available in the scikit-learn library or you can also download it from the UCI Machine Learning Library.

在模型的构建部分中，您可以使用IRIS数据集，这是一个非常著名的多类分类问题。该数据集包括4个特征(花萼长度，萼片宽度，花瓣长度，花瓣宽度)和一个目标(花朵的类型)。该数据具有三种类型的花类：osa蝶，杂色和弗吉尼亚。数据集在scikit-learn库中可用，或者您也可以从UCI机器学习库中下载。

# Load datairis = datasets.load_iris()X = iris.datay = iris.target

分割数据集 (Split dataset)

To understand model performance, dividing the dataset into a training set and a test set is a good strategy.

为了了解模型的性能，将数据集分为训练集和测试集是一个很好的策略。

Let’s split dataset by using function train_test_split(). you need to pass basically 3 parameters features, target, and test_set size.

让我们使用函数train_test_split()拆分数据集。您基本上需要通过3个参数功能，目标和test_set大小。

# Split dataset into training set and test setX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # 70% training and 30% test

建立AdaBoost模型 (Building AdaBoost Model)

Let’s create the AdaBoost Model using Scikit-learn. AdaBoost uses the Decision Tree Classifier as a default Classifier.

让我们使用Scikit-learn创建AdaBoost模型。 AdaBoost使用决策树分类器作为默认分类器。

# Create adaboost classifer objectabc = AdaBoostClassifier(n_estimators=50,learning_rate=1)# Train Adaboost Classifermodel = abc.fit(X_train, y_train)#Predict the response for test datasety_pred = model.predict(X_test)

The most important parameters are base_estimator, n_estimators, and learning_rate.

最重要的参数是base_estimator，n_estimators和learning_rate。

base_estimator: It is a weak learner used to train the model. It uses DecisionTreeClassifier as a default weak learner for training purposes. you can also specify different machine learning algorithms.

base_estimator：这是用于训练模型的较弱的学习者。它使用DecisionTreeClassifier作为默认的弱学习者进行培训。您还可以指定不同的机器学习算法。

n_estimators: Number of weak learners to train iteratively.

n_estimators：要反复训练的弱学习者的数量。

learning_rate: It contributes to the weights of weak learners. It uses 1 as a default value.

learning_rate：它有助于弱学习者的体重。它使用1作为默认值。

评估模型 (Evaluate Model)

Let’s estimate, how accurately the classifier or model can predict the type of cultivars.

让我们估计一下分类器或模型可以多么准确地预测品种的类型。

Accuracy can be computed by comparing actual test set values and predicted values.

可以通过比较实际测试设置值和预测值来计算准确性。

# Model Accuracy, how often is the classifier correct?print("Accuracy:",metrics.accuracy_score(y_test, y_pred))Output:Accuracy: 0.8888888888888888

Well, you got an accuracy of 88.88%, considered as good accuracy.

好吧，您的精度为88.88％，被认为是很好的精度。

For further evaluation, you can also create a model using different Base Estimators.

为了进一步评估，您还可以使用不同的基本估算器来创建模型。

使用不同的基础学习者 (Using Different Base Learners)

I have used SVC as a base estimator, you can use any ML learner as a base estimator if it accepts sample weight such as Decision Tree, Support Vector Classifier.

我已经使用SVC作为基本估计量，如果ML学习器接受诸如决策树，支持向量分类器之类的样本权重，则可以将其用作基本估计量。

# Load librariesfrom sklearn.ensemble import AdaBoostClassifier# Import Support Vector Classifierfrom sklearn.svm import SVC# Import scikit-learn metrics module for accuracy calculationfrom sklearn import metrics# create base classifiersvc=SVC(probability=True, kernel='linear')# Create adaboost classifier objectabc =AdaBoostClassifier(n_estimators=50, base_estimator=svc,learning_rate=1)# Train Adaboost Classifermodel = abc.fit(X_train, y_train)# Predict the response for test datasety_pred = model.predict(X_test)# Model Accuracy, how often is the classifier correct?print("Accuracy:",metrics.accuracy_score(y_test, y_pred))Output:Accuracy: 0.9555555555555556

Well, you got a classification rate of 95.55%, considered as good accuracy.

嗯，您的分类率为95.55％，被认为是不错的准确性。

In this case, SVC Base Estimator is getting better accuracy than the Decision tree Base Estimator.

在这种情况下，SVC基本估算器比决策树基本估算器获得更好的准确性。

优点 (Pros)

AdaBoost is easy to implement. It iteratively corrects the mistakes of the weak classifier and improves accuracy by combining weak learners. You can use many base classifiers with AdaBoost. AdaBoost is not prone to overfitting. This can be found out via experiment results but there is no concrete reason available.

AdaBoost易于实现。它迭代地纠正弱分类器的错误，并通过组合弱学习者来提高准确性。您可以在AdaBoost中使用许多基本分类器。 AdaBoost不容易过度拟合。可以通过实验结果找到这一点，但没有具体原因。

缺点 (Cons)

AdaBoost is sensitive to noise data. It is highly affected by outliers because it tries to fit each point perfectly. AdaBoost is slower compared to XGBoost.

AdaBoost对噪声数据敏感。它受异常值的影响很大，因为它试图完美地拟合每个点。与XGBoost相比，AdaBoost速度较慢。

结论 (Conclusion)

Congratulations, you have made it to the end of this tutorial!

恭喜，您已完成本教程的结尾！

In this tutorial, you have learned the Ensemble Machine Learning Approaches, AdaBoost algorithm, it’s working, model building, and evaluation using the Python Scikit-learn package. Also, discussed its pros and cons.

在本教程中，您学习了Ensemble机器学习方法，AdaBoost算法，它的工作，使用Python Scikit-learn包进行的模型构建和评估。另外，讨论了它的利弊。

I look forward to hearing any feedback or questions. you can ask the question by leaving a comment and I will try my best to answer it.

我期待听到任何反馈或问题。您可以通过发表评论来提问，我会尽力回答。

For more such tutorials, projects, and courses visit DataCamp

有关更多此类教程，项目和课程的信息，请访问DataCamp

Originally published at https://www.datacamp.com/community/tutorials/naive-bayes-scikit-learn

最初发布在https://www.datacamp.com/community/tutorials/naive-bayes-scikit-learn

Reach out to me on Linkedin: https://www.linkedin.com/in/avinash-navlani/

在Linkedin上与我联系： https ： //www.linkedin.com/in/avinash-navlani/

普通英语的Python (Python In Plain English)

Enjoyed this article? If so, get more similar content by subscribing to Decoded, our YouTube channel!

喜欢这篇文章吗？如果是这样，请订阅我们的YouTube频道解码，以获得更多类似的内容！

翻译自: https://medium.com/python-in-plain-english/adaboost-classifier-in-python-8d34a9f20459

Processed: 0.015, SQL: 8