机器学习模型 非线性模型

    科技2025-03-31  11

    机器学习模型 非线性模型

    While working on a machine learning problem wouldn’t it be better if we can make a quick comparison of a few models which would in turn help us in deciding to which model do we dedicate our time and resources. PyCaret is a library which helps in doing precisely that. The same can be done by even scikit-learn but what sets PyCaret apart is that it gives us a low code version to work on which speeds up the process. But it is not limited to prototyping, but also can be used to develop and deploy a full scale machine learning model.

    在解决机器学习问题时,如果我们可以快速比较几个模型,这将更好,这反过来又可以帮助我们确定我们将哪种模型和时间专门用于资源。 PyCaret是一个库,可以帮助您做到这一点。 即使是scikit-learn,也可以做到这一点,但是PyCaret的与众不同之处在于,它为我们提供了低代码版本,可加快处理速度。 但是它不仅限于原型制作,还可以用于开发和部署全面的机器学习模型。

    As we see in figure above, PyCaret provides us the entire gamut of operations for machine learning. We would run through all these steps in PyCaret and would be using the Titanic Dataset for a classification problem. PyCaret can be used to model both supervised and unsupervised problems. Here we would cover supervised — classification to get a gist of the PyCaret library.

    如上图所示,PyCaret为我们提供了机器学习的全部操作范围。 我们将在PyCaret中完成所有这些步骤,并将泰坦尼克号数据集用于分类问题。 PyCaret可以用于建模有监督和无监督的问题。 在这里,我们将介绍受监管的分类,以了解PyCaret库的要旨。

    1.数据预处理 (1. Data Preprocessing)

    Let us import the necessary libraries first. Next we would import the dataset using pandas and check a sample of the records.

    让我们首先导入必要的库。 接下来,我们将使用pandas导入数据集并检查记录样本。

    !pip install pycaretimport pandas as pdimport numpy as npfrom pycaret.classification import *train = pd.read_csv("../titanic/train.csv")train.head()

    Output:-

    输出:-

    In PyCaret there is a setup function which sets up the data file and the dependant variable. The preprocessing is also part of the setup function. We will keep adding preprocessing parameters to the setup function.

    在PyCaret中,有一个设置功能可以设置数据文件和因变量。 预处理也是设置功能的一部分。 我们将继续向设置功能添加预处理参数。

    Let’s start off with Missing values. Survived is our target variable.

    让我们从Missing values开始。 幸存下来是我们的目标变量。

    clf = setup(data = train, target = 'Survived', numeric_imputation ="mean", categorical_imputation = "constant" )

    Here we are imputing the numeric variables with their mean and the categorical variables with “not available”. These are the default imputation parameters, these imputations take place even if we done explicitly pass these parameters in the setup function. The other options available for imputing are “median” for numeric and “mode” for categorical.

    在这里,我们将数字变量的均值和“不可用”的类别变量均插值。 这些是默认的插补参数,即使我们在设置函数中明确传递了这些参数,也会进行插补。 可用于插补的其他选项是“中位数”代表数字,“模式”代表分类。

    One Hot encoding is the next step as we would need to transform the categorical variables into numeric type to input it in a machine learning model. We save time here as PyCaret automatically one hot encodes all the categorical variables when we setup the data (using the setup function).

    下一步是热编码,因为我们需要将分类变量转换为数字类型,以将其输入到机器学习模型中。 我们在这里节省了时间,因为PyCaret在设置数据时(使用设置功能)会自动对所有分类变量进行热编码。

    There might be ordinality to your categorical variables which would cause information loss if one hot encoded. For that there is “ordinal_features” parameter. The sequence must be from lowest to highest while defining the parameters. In our dataset we can use it for Passenger class variable.

    如果一个热编码,分类变量可能存在序号,这将导致信息丢失。 为此,存在“ ordinal_features”参数。 定义参数时,顺序必须从最低到最高。 在我们的数据集中,我们可以将其用于乘客类变量。

    clf = setup(data = train, target = 'Survived', numeric_imputation ="mean", categorical_imputation = "constant", ordinal_features = {'Pclass' : ['0', '1', '2']} )

    PyCaret identifies Pclass as a categorical variable even though it is of numeric type due to the no of unique values in this field.

    尽管Pclass是数字类型,但PyCaret仍将其标识为类别变量,因为该字段中没有唯一值。

    Sometimes we may come across categories in real word/test data that was not present while training our models. PyCaret gives us a way to handle unknown categories. There are the “handle_unknown_categorical” and “unknown_categorical_method” parameters which enable us to handle these unknown categories.

    有时,我们可能会在训练模型时遇到真实单词/测试数据中没有的类别。 PyCaret为我们提供了一种处理未知类别的方法。 有“ handle_unknown_categorical”和“ unknown_categorical_method”参数使我们能够处理这些未知类别。

    clf = setup(data = train, target = 'Survived', numeric_imputation ="mean", categorical_imputation = "constant", ordinal_features = {'Pclass' : ['0', '1', '2']},handle_unknown_categorical=True, unknown_categorical_method ="least_frequent" )

    The default values for these parameters are True and “least_frequent”. We can turn this off by setting the flag as false for handle_unknown_categorical. We can even change the method to “most_frequent”. The unknown category gets replaced by the most_frequent or least_frequent category as per our definition.

    这些参数的默认值为True和“ least_frequent”。 我们可以通过将handle_unknown_categorical的标志设置为false来关闭此功能。 我们甚至可以将方法更改为“ most_frequent”。 根据我们的定义,未知类别将替换为most_frequent或Minimum_frequent类别。

    Normalisation/Scaling is a critical component of any preprocessing process. We can’t expect to get correct output without scaling our data in all models which use euclidean distance.

    规范化/缩放是任何预处理过程的关键组成部分。 在所有使用欧几里德距离的模型中,我们都不能期望在不缩放数据的情况下获得正确的输出。

    PyCaret provides us the paramters normalize and normalize_method for scaling. The methods available are “z-score”, “minimax”, “maxabs”, “robust”. We would find the scikit-learn equivalents for all these methods. “z-score” uses standard scaling and is the default method.

    PyCaret为我们提供了参数normalize和normalize_method进行缩放。 可用的方法是“ z分数”,“ minimax”,“ maxabs”,“健壮”。 我们将为所有这些方法找到scikit-learn等效项。 “ z分数”使用标准缩放比例,是默认方法。

    clf = setup(data = train, target = 'Survived', numeric_imputation ="mean", categorical_imputation = "constant", ordinal_features = {'Pclass' : ['0', '1', '2']},handle_unknown_categorical=True, unknown_categorical_method ="least_frequent",normalize = True, normalize_method= "z-score")

    Binning of a continuous features is also an important part of feature engineering and can be used to refine the input model in some cases. PyCaret provide the bin_numeric_features parameter for this transformation. The “sturges” rule is used to determine the number of bins and K-means is used for conversion to bins.

    连续要素的合并也是要素工程的重要组成部分,在某些情况下可用于完善输入模型。 PyCaret提供此转换的bin_numeric_features参数。 “ sturges”规则用于确定垃圾箱的数量,K-means用于转换为垃圾箱。

    clf = setup(data = train, target = 'Survived', numeric_imputation ="mean", categorical_imputation = "constant", ordinal_features = {'Pclass' : ['0', '1', '2']},handle_unknown_categorical=True, unknown_categorical_method ="least_frequent",normalize = True, normalize_method= "z-score",bin_numeric_features=["Age"]) Data Setup Snapshot 数据设置快照

    What we see on the above is the output that we get on running the setup.

    我们在上面看到的是运行安装程序时得到的输出。

    2.模型比较 (2. Model Comparison)

    Normally we compare models post modelling. But PyCaret gives us an option to evaluate multiple algorithms. The scoring is done on a holdout/test set.

    通常,我们比较建模后的模型。 但是PyCaret为我们提供了评估多种算法的选项。 评分是在保留/测试集上完成的。

    compare = compare_models()

    Output

    输出量

    Model Comparison Output 模型比较输出

    Just this one line of code has given us a comparison of 15 algorithms. They are scored basis Accuracy, AUC, Recall, Precision, F1 score, Kappa and MCC. By default the list is sorted by the best accuracy score.

    仅这一行代码就为我们提供了15种算法的比较。 他们根据准确度,AUC,召回率,精确度,F1得分,Kappa和MCC得分。 默认情况下,列表按最佳准确性得分排序。

    There are some parameters which can be used along with the compare_models function as show below.

    有一些参数可以与compare_models函数一起使用,如下所示。

    top3 = compare_models(n_select = 5) # Top 5 Modelsbest = compare_models(sort = 'AUC') # Sorted by AUCbest_specific = compare_models(whitelist = ['lr','knn','dt']) # only these three models comparedbest_specific = compare_models(blacklist = ['catboost', 'xgboost']) # compare all models except for categorical boost and XGBoost.

    3.模型创建 (3. Model Creation)

    From the comparison we did earlier we can narrow down our list of choices for modelling. Since Categorical Boost gave us the best accuracy score. Let us use that for creating our model.

    通过我们之前进行的比较,我们可以缩小建模的选择范围。 自从分类助推器为我们提供了最佳的准确性得分。 让我们用它来创建我们的模型。

    cat_boost = create_model('catboost') Create model output 创建模型输出

    We have a 10 fold result as the output of our model creation. We can specify if we want a KFold solution and the number if folds using the cross_validation and fold parameters. The accuracy averages out to 82.54% which is what we got during the comparison. Let us see if we can improve on this.

    作为模型创建的输出,我们得到10倍的结果。 我们可以使用cross_validation和fold参数指定是否要使用KFold解决方案,并指定是否折叠。 准确度的平均值为82.54%,这是我们在比较期间得出的结果。 让我们看看是否可以对此进行改进。

    4.超参数调整 (4. Hyperparameter Tuning)

    Hyperparameter Tuning in PyCaret can also be done in a single line of code. This is done through random grid search through predefined grids which can be customized.

    PyCaret中的超参数调整也可以在一行代码中完成。 这是通过可自定义的预定义网格通过随机网格搜索完成的。

    tuned_catboost = tune_model(cat_boost) Hyperparameter tuning output 超参数调整输出

    Since the results has deteriorated we will continue without tuning. If we want to continue with the tuning manually we can create a customized grid for the hyperparameters to be tried out.

    由于结果恶化了,我们将继续进行而无需进行调整。 如果要手动继续进行调整,则可以为要尝试的超参数创建自定义网格。

    params = {catboost hyperparamters dictionary grid}tuned_catboost = tune_model(cat boost, custom_grid = params)

    5.组装 (5. Ensembling)

    PyCaret gives us the ability to ensemble models also. We have the entire spectrum of ensembling at our disposal — bagging, boosting, blending and stacking.

    PyCaret还使我们能够集成模型。 我们拥有整套合奏服务,包括装袋,增强,混合和堆叠。

    We would apply boosting for our dataset. We have applied Adaboost with 100 estimators. We can modify the method to Bagging.

    我们将对数据集应用提升。 我们已将Adaboost应用于100个估算器。 我们可以将方法修改为Bagging。

    boosted_dt = ensemble_model(cat_boost, method = 'Boosting', n_estimators = 100) Ensemble Output 合奏输出

    There is no improvement here. We can use stacking and blending to check for better results.

    这里没有任何改善。 我们可以使用堆叠和混合来检查更好的结果。

    6.部署 (6. Deployment)

    We would split the deployment in two parts — 1. finalizing the model to make predictions , 2. Deploying the model in aws

    我们将部署分为两部分-1.最终确定模型以进行预测,2.将模型部署到AWS中

    We will be covering only the first part here. The deployment to aws can be checked out in the PyCaret documentation.

    我们将在这里仅介绍第一部分。 可以在PyCaret文档中检出到AWS的部署。

    First let’s finalize the data and make predictions on a test set. When we finalize the data the model is trained on the whole training set i.e. including the test/holdout set.

    首先,让我们最终确定数据并在测试集上进行预测。 当我们最终确定数据时,将在整个训练集(即包括测试/保持集)上训练模型。

    catboost_final = finalize_model(cat_boost)test_predictions = predict_model(catboost_final, data=test)test_predictions.head() Predictions on the test set 对测试集的预测

    The label columns has all the predictions for the test set.

    标签列包含测试集的所有预测。

    I hope you enjoyed reading through this guide which can help you in rapid prototyping when you are starting on a project. It can help you in deciding which direction to take and also in case of paucity of time this library helps in getting very good results quickly.

    我希望您喜欢阅读本指南,它可以帮助您在开始项目时快速进行原型制作。 它可以帮助您决定选择哪个方向,并且在时间有限的情况下,该库有助于快速获得非常好的结果。

    翻译自: https://medium.com/@ankur.salunke/pycaret-prepare-your-machine-learning-model-in-minutes-db7525d8615

    机器学习模型 非线性模型

    相关资源:机器学习——非线性回归python实现
    Processed: 0.012, SQL: 8