银行存款数据库理数据类型

    科技2023-11-28  92

    银行存款数据库理数据类型

    Part 1 out of 4 will be short posts about the 4 different machine learning algorithms that were used on the bank data.

    第4部分的第1部分将简短介绍银行数据中使用的4种不同的机器学习算法。

    随机森林 (Random Forest)

    Random forest is an ensemble method that samples on a random subset of features and uses Bootstrap Aggregation (Bagging) to classify. Bagging is a sampling technique that samples with replacement of the data on each tree. We can then use Out of Bag Data, one thirds of the data left, to measure the performance of each tree, essentially testing the each tree on unseen data to measure the models performance.

    随机森林是一种集成方法,可以对特征的随机子集进行采样,并使用Bootstrap聚合(Bagging)进行分类。 套袋是一种采样技术,用于替换每棵树上的数据。 然后,我们可以使用剩余数据的三分之二的Out of Bag数据来衡量每棵树的性能,本质上是在看不见的数据上测试每棵树以衡量模型的性能。

    from sklearn.ensemble import RandomForestClassifierrandom_forest = RandomForestClassifier(random_state=19, oob_score=True)# Random Forest grid searchparam_forest = {"n_estimators":[300, 400], "max_depth":[5, 10, 15, 20]}grid_forest = GridSearchCV(random_forest, param_grid=param_forest, cv=10)# Fitting training data to modelgrid_forest.fit(X_train_new, y_train_new)

    The code above shows use importing the Random Forest Classifier, instantiating, creating a grid search where the parameters that will be tuned are “n_estimators,” the amount of trees that will be splitting on a decision, and “max_depth,” how deep the tree will go, meaning more splits the denser the tree.

    上面的代码展示了如何使用导入随机森林分类器,实例化,创建网格搜索,其中要调整的参数为“ n_estimators”,将根据决策拆分的树数量以及“ max_depth”(树的深度)将会消失,意味着更多的树木会分裂得更密。

    After fitting the model we can look at the best parameters and the cross validation results:

    拟合模型后,我们可以查看最佳参数和交叉验证结果:

    which shows us that we should get an accuracy score around 70% on our test data.

    这表明我们应该在测试数据上获得70%左右的准确性得分。

    display(grid_forest.score(X_train_new, y_train_new))display(grid_forest.score(X_test, y_test))

    The code above gave horrible results that show an immense amount of Bias, under-fitting.

    上面的代码给出了可怕的结果,显示出大量的偏见,不符合要求。

    The training score showed a result of 93% while the test score showed 70%. Although this is a horrible accuracy score this is not what we completely care about.

    训练成绩显示为93%,测试成绩显示为70%。 尽管这是一个可怕的准确性得分,但这并不是我们完全关心的。

    Since our dependent variable is binary and we care about which characteristics do people have that pay off their loan in full, helping banks focus on which characteristics need to be strong in order to give a loan. This means that it’s best to use the confusion matrix, another performance metric for classification machine learning models.

    由于我们的因变量是二元变量,因此我们关注人们有哪些特征可以全额还清贷款,从而帮助银行专注于哪些特征需要强才能发放贷款。 这意味着最好使用混淆矩阵,这是分类机器学习模型的另一个性能指标。

    混淆矩阵 (Confusion Matrix)

    Why focus on confusion matrix and not accuracy or AUC ROC scores? This is because we have imbalanced data. If you can remember, we used SMOTE because our data was heavily imbalanced. SMOTE was used on our training data, so it makes sense why we are receiving bias results. Accuracy is telling us that our model predicted 93% of the points in the correct class, but it does not specify which. AUC ROC is also not a good idea to use on imbalanced data.

    为什么要关注混淆矩阵而不是准确性或AUC ROC分数? 这是因为我们的数据不平衡。 如果您还记得,我们使用SMOTE是因为我们的数据严重不平衡。 在我们的训练数据上使用了SMOTE,因此我们为什么会收到偏差结果很有意义。 准确性告诉我们,我们的模型预测了正确类别中的93%的点,但没有指定哪个点。 在不平衡数据上使用AUC ROC也不是一个好主意。

    # Able to determine metrics for a confusion matrixdef confusion_matrix_metrics(TN:int, FP:int, FN:int, TP:int, P:int, N:int): print("TNR:", TN/N) print("FPR:", FP/N) print("FNR:", FN/P) print("TPR:", TP/P) print("Precision:", TP/(TP+FP)) # % of our positive predictions that we predicted correctly. print("Recall:", TP/(TP+FN)) # % of ground truth positives that we predicted correctly. print("F1 Score:", (2*TP)/((2*TP) + (FP + FN))) # the harmonic mean of precision and recall and is a better measure than accuracprint('Confusion Matrix - Testing Dataset')print(pd.crosstab(y_test, grid_forest.predict(X_test), rownames=['True'], colnames=['Predicted'], margins=True)) Confusion Matrix on test data 测试数据混淆矩阵

    The code above shows a function that calculates TPR, TNR, FNR, FPR, Precision, Recall, and F1 Score.

    上面的代码显示了一个计算TPR,TNR,FNR,FPR,Precision,Recall和F1得分的函数。

    These scores will help us determine if how the model is classifying the data:

    这些分数将帮助我们确定模型是否对数据进行分类:

    Confusion Matrix scores on test data 测试数据混淆矩阵得分

    The image above shows us how the model did at classifying the positive (1) points.

    上图为我们展示了该模型如何对正(1)点进行分类。

    功能重要性 (Feature Importance)

    We can see what features were important to the model:

    我们可以看到哪些功能对模型很重要:

    # Graphingfig, ax = plt.subplots(figsize=(15, 10))ax.barh(width=rf_clf.feature_importances_, y=X_train_new.columns) Feature importance 功能重要性

    The code above provided a visualization of our machine learning model deciding what features are more important than others, the higher the score, the more important those features are. From this, we will gain information on what is more important for banks to look at when deciding to give a loan to someone.

    上面的代码对我们的机器学习模型进行了可视化,从而确定了哪些功能比其他功能更重要,分数越高,这些功能就越重要。 由此,我们将获得有关银行在决定向某人贷款时应考虑的更重要内容的信息。

    The graph above shows that there are many features that the Random Forest model categorized as important. We will have to give our best judgement on what we want our threshold to be in order to select the main important features:

    上图显示随机森林模型归类为重要的许多功能。 为了选择主要的重要功能,我们将必须对要达到的阈值做出最好的判断:

    # Selecting the top features at a cap of 0.05top_important_features = np.where(rf_clf.feature_importances_ > 0.05)print(top_important_features)print(len(top_important_features[0])) # Number of features that qualify# Extracting the top feature column namestop_important_feature_names = [columns for columns in X_train_new.columns[top_important_features]]top_important_feature_names

    The code above shows that we set our threshold greater than 0.5, and in return receive these features above as the more paramount ones.

    上面的代码显示,我们将阈值设置为大于0.5,作为回报,以上这些功能是最重要的功能。

    进一步的步骤 (Further Steps)

    We can also go a step further, we can take these features and create a new subset data with only these paramount features as our new independent variables, and then run them through our Random Forest grid search, view the cross validation results, and then run it through a confusion matrix to see how our model has performed with new important features.

    我们还可以走得更远,我们可以利用这些功能并仅将这些最重要的功能作为新的自变量来创建新的子集数据,然后通过“随机森林”网格搜索运行它们,查看交叉验证结果,然后运行通过混淆矩阵来了解我们的模型在新的重要功能方面的表现。

    I have already finished these steps and will just show the end results because that would be a lot of work to put them in this post.

    我已经完成了这些步骤,仅会显示最终结果,因为将它们放置在这篇文章中需要大量工作。

    Further steps/Confusion Matrix on test data 测试数据的进一步步骤/混淆矩阵

    The image above shows the results of our important features. It looks like we have a lower F1 score when using important features.

    上图显示了我们重要功能的结果。 使用重要功能时,我们的F1分数似乎较低。

    为什么选择随机森林? (Why Random Forest?)

    Since this is the end I feel like this would be a good time to perform a Tarantino and explain the beginning. I use a Random Forest model because they are great with handling large binary data. Random forest is also great when preventing over-fitting, and I could also feed the model our raw, but purified data, allowing me to pass it through without feature scaling.

    既然这是结局,我觉得这将是表演Tarantino和解释开始的好时机。 我使用随机森林模型是因为它们非常适合处理大型二进制数据。 防止过度拟合时,随机森林也非常有用,我还可以将原始但纯净的数据输入模型,从而无需特征缩放即可通过模型。

    翻译自: https://medium.com/analytics-vidhya/bank-data-classification-part-1-6b61506086cd

    银行存款数据库理数据类型

    相关资源:商业银行的信息系统介绍
    Processed: 0.013, SQL: 8