每个数据科学家都应该知道的4个python automl库

科技2023-11-24 99

Automated Machine Learning, often abbreviated as AutoML, is an emerging field in which the process of building machine learning models to model data is automated. AutoML has the capability to make modelling easier and more accessible for everyone.

自动化机器学习(通常缩写为AutoML)是一个新兴领域，在该领域中，建立机器学习模型以建模数据的过程是自动化的。 AutoML的功能使每个人都能更轻松地进行建模。

If you’re interested in checking out AutoML, these four Python libraries are the way to go. A comparison will be provided at the end.

如果您有兴趣签出AutoML，那么可以使用这四个Python库。最后将进行比较。

1 | 自动学习 (1 | auto-sklearn)

auto-sklearn is an automated machine learning toolkit that integrates seamlessly with the standard sklearn interface so many in the community are familiar with. With the use of recent methods like Bayesian Optimization, the library is built to navigate the space of possible models and learns to infer if a specific configuration will work well on a given task.

auto-sklearn是一个自动化的机器学习工具包，可与社区中许多人熟悉的标准sklearn接口无缝集成。通过使用贝叶斯优化之类的最新方法，该库可在可能的模型空间中导航，并学会推断特定配置是否可以很好地完成给定任务。

Created by Matthias Feurer, et al., the library’s technical details are described in a paper, Efficient and Robust Machine Learning. Feurer writes:

由Matthias Feurer等人创建，该库的技术细节在论文《高效和鲁棒的机器学习》中进行了描述。费勒写道：

… we introduce a robust new AutoML system based on scikit-learn (using 15 classifiers, 14 feature preprocessing methods, and 4 data preprocessing methods, giving rise to a structured hypothesis space with 110 hyperparameters).

…我们引入了一个基于scikit-learn的强大的新AutoML系统(使用15个分类器，14个特征预处理方法和4种数据预处理方法，从而形成具有110个超参数的结构化假设空间)。

auto-sklearn is perhaps the best library to get started with AutoML. In addition to discovering data preparation and model selections for a dataset, it learns from models that perform well on similar datasets. Top-performing models are aggregated in an ensemble.

auto-sklearn也许是开始使用AutoML的最佳库。除了发现数据集的数据准备和模型选择之外，它还从在相似数据集上表现良好的模型中学习。效果最佳的模型汇总在一起。

Efficient and Robust Automated Machine Learning, 2015. Image free to share. 高效，强大的自动化机器学习》，2015年。图片免费共享。

On top of an efficient implementation, auto-sklearn requires minimal user interaction. Install the library with pip install auto-sklearn.

在高效的实现之上， auto-sklearn需要最少的用户交互。使用pip install auto-sklearn安装库。

The primary classes that can be used are AutoSklearnClassifier and AutoSklearnRegressor, which operate on classification and regression tasks, respectively. Both have the same user-specified parameters, of which the most important involve time constraints and ensemble sizes.

可以使用的主要类是AutoSklearnClassifier和AutoSklearnRegressor ，它们分别用于分类和回归任务。两者都具有用户指定的相同参数，其中最重要的是时间限制和合奏大小。

import autosklearn as ask #ask.regression.AutoSklearnRegressor() for regression tasks model = ask.classification.AutoSklearnClassifier(ensemble_size=10, #size of the end ensemble (minimum is 1) time_left_for_this_task=120, #the number of seconds the process runs for per_run_time_limit=30) #maximum seconds allocated per model model.fit(X_train, y_train) #begin fitting the search model print(model.sprint_statistics()) #print statistics for the search y_predictions = model.predict(X_test) #get predictions from the model

View more AutoSklearn documentation here. If you are running into installation issues, check out some threads: issue installing pyfyr, failed building wheel for pyfyr.

在此处查看更多AutoSklearn文档。如果您遇到安装问题，请检查一些线程：安装pyfyr ， pyfyr的构建轮子失败。

2 | TPOT (2 | TPOT)

TPOT is another Python library that automates the modelling pipeline, with a greater emphasis on data preparation as well as modelling algorithms and model hyperparameters. It automates feature selection, preprocessing, and construction through an evolutionary tree-based structure “called the Tree-based Pipeline Optimization Tool (TPOT) that automatically designs and optimizes machine learning pipelines.” (TPOT Paper)

TPOT是另一个使建模流程自动化的Python库，它更加着重于数据准备以及建模算法和模型超参数。它通过进化的基于树的结构“ 称为基于树的管道优化工具(TPOT)，可自动设计和优化机器学习管道，从而实现特征选择，预处理和构造的自动化。” (TPOT论文)

Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science, 2016. Image free to share. 对用于自动化数据科学的基于树的管道优化工具的评估，2016年。图片免费共享。

Programs, or pipelines, are represented as trees. Genetic programs select and evolve certain programs to maximize the end result of each automated machine learning pipeline.

程序或管道表示为树。遗传程序选择并演化某些程序以最大化每个自动化机器学习流程的最终结果。

As Pedro Domingos says, “a dumb algorithm with lots of data beats a clever one with limited data.” This is indeed the case: TPOT can generate sophisticated data preprocessing pipelines.

正如Pedro Domingos所说：“具有大量数据的愚蠢算法要胜过具有有限数据的聪明算法。” 的确是这样：TPOT可以生成复杂的数据预处理管道。

TPOT documentation. Image free to share. TPOT文档。图片免费分享。

TPOT pipeline optimizers can take a few hours to produce great results, as many AutoML algorithms are (unless the dataset is small). You can run these long programs in, for example, Kaggle commits or Google Colab.

TPOT管道优化器可能需要几个小时才能产生出色的结果，因为许多AutoML算法都是这样(除非数据集很小)。您可以在例如Kaggle commits或Google Colab中运行这些长程序。

import tpot pipeline_optimizer = tpot.TPOTClassifier(generations=5, #number of iterations to run the training population_size=20, #number of individuals to train cv=5) #number of folds in StratifiedKFold pipeline_optimizer.fit(X_train, y_train) #fit the pipeline optimizer - can take a long time print(pipeline_optimizer.score(X_test, y_test)) #print scoring for the pipeline pipeline_optimizer.export('tpot_exported_pipeline.py') #export the pipeline - in Python code!

Perhaps the best feature of TPOT is that it exports your model as a Python code file, which can be used later.

TPOT的最好功能可能是将模型导出为Python代码文件，以后可以使用。

View the TPOT documentation here. This library allows lots of customizability and deep features. View examples of using TPOT here, including an achievement of 98% accuracy on a subsample of the MNIST dataset without any deep learning — only standard sklearn models.

在此处查看TPOT文档。该库允许许多可定制性和深层功能。在此处查看使用TPOT的示例，包括在MNIST数据集的子样本上实现98％的准确性，而无需任何深度学习-仅使用标准sklearn模型。

3 | HyperOpt (3 | HyperOpt)

HyperOpt is a Python library for Bayesian optimization, developed by James Bergstra. Designed for optimization of models with hundreds of parameters on a large scale, the library is explicitly used to optimize machine learning pipelines, with options to scale the optimization procedure across several cores and machines.

HyperOpt是由James Bergstra开发的用于贝叶斯优化的Python库。该库专为大规模优化具有数百个参数的模型而设计，可显式地用于优化机器学习管道，并具有用于选择跨多个核心和机器的优化过程的选项。

Our approach is to expose the underlying expression graph of how a performance metric (e.g. classification accuracy on validation examples) is computed from hyperparameters that govern not only how individual processing steps are applied, but even which processing steps are included.- Source: Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures.

我们的方法是揭示如何根据超参数计算性能度量(例如，验证示例中的分类准确性)的基础表达图，这些超参数不仅控制如何应用各个处理步骤，而且还控制包括哪些处理步骤。 -来源：建立模型搜索科学：视觉体系结构中数百个维度的超参数优化。

However, HyperOpt is difficult to use directly, since it’s very technical and requires optimization procedures and parameters to be carefully specified. Instead, it’s recommended to use HyperOpt-sklearn, a wrapper around HyperOpt that incorporates the sklearn library.

但是， HyperOpt很难直接使用，因为它非常技术性，并且需要仔细指定优化过程和参数。相反，建议使用HyperOpt-sklearn，这是HyperOpt的包装器，其中包含sklearn库。

Specifically, HyperOpt has a heavy focus on the dozens of hyperparameters that go into specific models, although it does support preprocessing. Consider the result of one HyperOpt-sklearn search, which resulted in a Gradient Boosting Classifier with no preprocessing:

具体来说，HyperOpt尽管确实支持预处理，但仍将重点放在特定模型中的数十个超参数上。考虑一次HyperOpt-sklearn搜索的结果，该结果导致没有预处理的梯度提升分类器：

{'learner': GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None, learning_rate=0.009132299586303643, loss='deviance', max_depth=None, max_features='sqrt', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=342, n_iter_no_change=None, presort='auto', random_state=2, subsample=0.6844206624548879, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False), 'preprocs': (), 'ex_preprocs': ()}

The documentation for constructing HyperOpt-sklearn models can be found here. It is fairly more complicated than auto-sklearn and a bit more than TPOT, but it may be worth it if hyperparameters are important.

可以在此处找到有关构建HyperOpt-sklearn模型的文档。它比auto-sklearn复杂得多，比TPOT还要复杂，但是如果超参数很重要，则值得这样做。

4 | AutoKeras (4 | AutoKeras)

Neural networks and deep learning are significantly more powerful, and hence more difficult to automate, than standard machine learning libraries.

神经网络和深度学习比标准的机器学习库功能强大得多，因此很难实现自动化。

With AutoKeras, a neural architecture search algorithm finds the best architectures, like the number of neurons in a layer, the number of layers, which layers to incorporate, layer-specific parameters like filter size or percent of dropped neurons in Dropout, etc. Once the search is complete, you can use the model as a normal TensorFlow/Keras model.

使用AutoKeras，神经体系结构搜索算法可以找到最佳的体系结构，例如层中神经元的数量，层数，要合并的层，特定于层的参数(如过滤器大小或Dropout中丢弃的神经元的百分比等)。搜索完成后，您可以将该模型用作普通的TensorFlow / Keras模型。 By using AutoKeras, you can build a model with complex elements like embeddings and spatial reductions that would otherwise be less accessible to those who are still in the process of learning deep learning.

通过使用AutoKeras，您可以使用包含嵌入和空间缩小等复杂元素的模型来构建模型，否则这些元素对于仍在学习深度学习过程中的人来说是较难获得的。 When AutoKeras creates models for you, much of preprocessing, like vectorizing or cleaning text data, is done and optimized for you.

当AutoKeras为您创建模型时，将为您完成并优化许多预处理，例如矢量化或清除文本数据。 It takes two lines to initiate and train a search. AutoKeras boasts a Keras-like interface, so it’s not hard to remember and use at all.

需要两行来启动和训练搜索。 AutoKeras具有类似Keras的界面，因此一点也不难记住和使用。

With support for text, image, and structured data, as well as interfaces both for beginners and those seeking to get more involved in technicalities, AutoKeras uses evolutionary neural architecture search methods to eliminate the hard work and ambiguity for you.

借助对文本，图像和结构化数据的支持，以及针对初学者和寻求更多参与技术知识的人的界面，AutoKeras使用了进化神经体系结构搜索方法，为您消除了辛苦和模棱两可的情况。

Although it takes long for AutoKeras to run, there are many user-specified parameters available to control the running time, number of models explored, the search space size, etc.

尽管AutoKeras的运行需要很长时间，但仍有许多用户指定的参数可用于控制运行时间，探索的模型数量，搜索空间大小等。

Consider this considered architecture for a text classification task generated using AutoKeras.

考虑使用AutoKeras生成的文本分类任务的这种考虑的体系结构。

Read a tutorial about using AutoKeras on structured, image, and text data here. View AutoKeras documentation here.

在此处阅读有关在结构化，图像和文本数据上使用AutoKeras的教程。在此处查看AutoKeras文档。

比较-应该使用哪一个？ (Comparison — Which one should I use?)

Use auto-sklearn if your priority is a clean, simple interface and relatively quick results. Additionally: natural integration with sklearn, works with commonly used models and methods, lots of control over timing.

如果您的工作重点是简洁，简单的界面和相对较快的结果，请使用auto-sklearn 。另外：与sklearn自然集成，可与常用的模型和方法配合使用，对时序进行大量控制。

Use TPOT if your priority is high accuracy, with disregard for potentially long training times. Emphasis on advanced preprocessing methods, made possible by representing pipelines as tree structures. Bonus: outputs Python code for the best models.

如果您的首要任务是高精度，那么请使用TPOT ，而无需考虑很长的培训时间。通过将流水线表示为树结构，可以强调高级预处理方法。奖励：输出用于最佳模型的Python代码。

Use HyperOpt-sklearn if your priority is high accuracy, with disregard for potentially long training times. Emphasis on hyperparameter optimization of models, which may or may not be productive, depending on the dataset and the algorithm.

如果您的首要任务是高精度，那么可以使用HyperOpt-sklearn考虑很长的培训时间。强调模型的超参数优化，这可能会或可能不会产生效果，具体取决于数据集和算法。

Use AutoKeras if your problem requires neural networks (warning: don’t overestimate their power), specifically if it comes in text or image form. It does take a long time to train, but there are extensive measures provided to control time and the size of the search space.

如果您的问题需要神经网络(警告：请不要高估其功能 )，请使用AutoKeras ，尤其是当它以文本或图像形式出现时。训练确实需要很长时间，但是提供了广泛的措施来控制时间和搜索空间的大小。

Thanks for reading! Good luck on your AutoML journey. 😃

谢谢阅读！祝您在AutoML旅途中一切顺利。 😃

翻译自: https://towardsdatascience.com/4-python-automl-libraries-every-data-scientist-should-know-680ff5d6ad08

相关资源：微信小程序源码-合集6.rar

Processed: 0.010, SQL: 8