机器学习回归预测

    科技2023-11-26  82

    机器学习回归预测

    Introduction: The applications of machine learning range from games to autonomous vehicles; one very interesting application is with education. With machine learning, regression algorithms, we can use a student dataset to predict the grades of students in their exams. This is an interesting application as it allows the teachers to be able to predict students grades early before the exams and find ways to assist the students who are not expected to perform so well. This article provides a detail explanation of how to use python to carry out this machine learning prediction task.

    介绍: 机器学习的应用范围从游戏到自动驾驶汽车; 一个非常有趣的应用是教育。 通过机器学习,回归算法,我们可以使用学生数据集来预测学生在考试中的成绩。 这是一个有趣的应用程序,它使教师能够在考试前及早预测学生的成绩,并找到方法来帮助那些表现不佳的学生。 本文详细说明了如何使用python来执行此机器学习预测任务。

    Dataset: This study considers data collected during the 2005–2006 school year from two public schools, from the Alentemol region of Portugal. The database was built from two sources: school reports, and questionnaires, related to several demographic (.e.g. Mother’s education, family income), social/emotional (e.g. alcohol consumption) and school related (e.g. number of past class failures) variables that are expected to affect student performance.

    资料集: 本研究考虑了2005-2006学年期间从葡萄牙阿连特莫尔地区的两所公立学校收集的数据。 该数据库由两个来源建立:学校报告和调查表,与以下几个人口统计指标(例如母亲的学历,家庭收入),社会/情感(例如饮酒)和学校相关(例如上课失败的次数)相关预期会影响学生的表现。

    The datasets used for this project is publicly available on Kaggle and can be downloaded with these urls:

    该项目使用的数据集可在Kaggle上公开获得,并可使用以下网址下载:

    - https://www.kaggle.com/ozberkgunes/makineogrenmesiodev2-student-grande-prediction/data

    -https://www.kaggle.com/ozberkgunes/makineogrenmesiodev2-student-grande-prediction/data

    - https://www.kaggle.com/imkrkannan/student-performance-data-set-y-uci

    -https://www.kaggle.com/imkrkannan/student-performance-data-set-y-uci

    The dimensions in the dataset are all explained and summarized in the table below.

    下表中解释并总结了数据集中的维度。

    Data Pre-processing: Before we can apply our regression algorithms on our dataset, we first need to pre-process our dataset to make sure we have handled both empty and categorical values.

    数据预处理: 在将回归算法应用于数据集之前,我们首先需要对数据集进行预处理,以确保我们处理了空值和分类值。

    Firstly, we check for the empty values in our dataset by using the isnull and sum function as shown in the code snippet below.

    首先,如下面的代码片段所示,我们使用notull和sum函数检查数据集中的空值。

    Code snippet for reading dataset and checking for null values 用于读取数据集并检查空值的代码段

    We find out we only have 1 empty values for each column; we figure out we have an insignificant number of empty rows, hence we simply drop all empty columns by using the dropna function.

    我们发现每一列只有1个空值; 我们发现空行的数量微不足道,因此我们只需使用dropna函数删除所有空列。

    Code Snippet for dropping all null columns 用于删除所有空列的代码段

    Selecting the Columns to Use for Regression Using Correlation:There are many ways to select the columns for regression; some of these ways include using p values, using their correlation, or using a feature selection method. In this case, our target column is G3 (the final exam result for the students), we decided to make use of a heatmap showing the correlation between all columns.

    使用相关性选择要用于回归的列: 有很多方法可以选择要回归的列。 其中一些方法包括使用p值,使用它们的相关性或使用特征选择方法。 在这种情况下,我们的目标列是G3(学生的期末考试成绩),我们决定使用显示所有列之间相关性的热图。

    Code snippet to create correlation heat map 代码段以创建关联热图

    From the heat map we find that the columns with the most relevant correlation to G3 are G1,G2, Medu and failures and hence these are the columns we will use for our regression. G1 and G2 represent the student’s performance in previous assessments. This is not surprising as we would expect students who perform well to most likely perform well, again, in the final assessment. Medu represents the mothers level of education, while failures represents the number of assessments previously failed by the student. Both these two properties are not surprisingly as anyone would expect a student with very few previously failed courses to do well in an exam.

    从热图中,我们发现与G3最相关的列是G1,G2,Medu和failures ,因此这些是我们将用于回归的列。 G1和G2代表学生在以前的评估中的表现。 这并不奇怪,因为我们期望表现良好的学生在最终评估中再次表现出色。 Medu代表母亲的教育程度,而失败则代表学生先前未通过的评估数量。 这两个属性都不令人惊讶,因为任何人都希望以前很少通过课程的学生在考试中表现出色。

    Applying Regression Algorithm:The selected columns are all non-categorical values so no need to use any method i.e. one hot encoding to handle categorical data hence we can move straight to applying our regression algorithm on our dataset with our selected columns (G1,G2, Medu and failures).

    应用回归算法: 选定的列都是非分类值,因此无需使用任何方法,即使用一种热编码来处理分类数据,因此我们可以直接将回归算法应用到具有选定列( G1,G2,Medu和failures )的数据集上。

    I created a function to run the regression models, it accepts the algorithms names as parameter along with the regression objects and prints out the resulting model’s accuracy and RMSE.

    我创建了一个运行回归模型的函数,它接受算法名称作为参数以及回归对象,并打印出结果模型的准确性和RMSE。

    For this project I selected 5 different regression algorithms to use on the dataset:

    对于此项目,我选择了5种不同的回归算法以用于数据集:

    - Linear regression

    -线性回归

    - Ridge regression

    -岭回归

    - Lasso regression

    -套索回归

    - Elastic Net regression

    -弹性净回归

    - Orthogonal Matching Pursuit CV regression

    -正交匹配追踪CV回归

    Before proceeding to training our regression models, we need to split our dataset into the training and testing data. This is very important as we don’t want to train and test our model with the same set of data hence the need for the split. We achieve this we the code snippet below:

    在继续训练回归模型之前,我们需要将数据集分为训练和测试数据。 这一点非常重要,因为我们不想使用相同的数据集来训练和测试模型,因此不需要拆分。 我们通过以下代码片段实现了这一目标:

    We simply create an array of these regression models and pass it to the run _reg_models function as shown in the code snippet below:

    我们只需创建一个这些回归模型的数组,然后将其传递给run _reg_models函数,如下面的代码片段所示:

    Results and Future Works:

    结果和未来工作:

    The table above shows the resulting accuracies for the different regressor models with Linear, ridge and Orthogonal matching pursuit CV having the highest accuracy of 82% and the others at 81%. There is not a lot of difference between the different regression models both with their accuracy and RMSE.

    上表显示了具有线性,脊形和正交匹配追踪CV的不同回归模型的最终精度,其最高精度为82%,其他精度为81%。 不同的回归模型在准确性和均方根误差方面没有太大差异。

    An accuracy of 82% is okay but we would still need to fine tune the hyper-parameters i.e. test with different parameters for the different regression algorithms. If we do this, we may get higher accuracy values. It’s also important to note that this is not the final accuracy, we would still need to test our model(s) with external datasets to see how robust our model is.

    82%的精度是可以的,但是我们仍然需要微调超参数,即针对不同的回归算法使用不同的参数进行测试。 如果这样做,我们可能会获得更高的精度值。 同样重要的是要注意,这不是最终的准确性,我们仍然需要使用外部数据集测试模型,以查看模型的健壮性。

    There are many ways to select a model i.e. time to train model, time to predict and many other methods, but in this case we will use the model with the least RMSE as they all have similar accuracies and all take a similar time to run.

    选择模型的方法有很多,例如训练模型的时间,预测的时间和许多其他方法,但是在这种情况下,我们将使用具有最小RMSE的模型,因为它们具有相似的精度并且都需要相似的时间来运行。

    Our selected model is the Orthogonal pursuit as it has the lowest RMSE. In the future, we will play around with the hyperparameters in order to see how much of a difference it makes with both the accuracy and RMSE and if it makes a high difference, we will use redo our model selection.

    我们选择的模型是正交追求,因为它具有最低的RMSE。 将来,我们将研究超参数,以了解它对精度和均方根误差的影响有多大,如果影响很大,我们将使用重做模型选择。

    A future work will be to test our models with an external dataset (any dataset like this but one that our model has not been trained with before) and see how well our model performs. It is also important to note that the dataset used in this article was from a study carried out 15 years ago, another possible future works will be to find a newer dataset and to test our model on it and see the result.

    未来的工作将是使用外部数据集(这样的任何数据集,但我们以前从未对其进行过训练的数据集)来测试我们的模型,并查看模型的性能如何。 还需要注意的是,本文中使用的数据集来自15年前进行的一项研究,未来的另一项可能的工作是找到一个更新的数据集并在其上测试我们的模型并查看结果。

    Extra: Using A Model For Prediction

    附加:使用模型进行预测

    It’s interesting that most articles don’t include a section where they show case the prediction abilities of the model. Here, I have added this section so anyone who is a bit confused understands exactly how to use our models for predictions. Our selected model is the Orthogonal Matching Pursuit CV and hence it is what we use for the prediction as shown with the code snippet below:

    有趣的是,大多数文章都没有包含展示模型预测能力的章节。 在这里,我添加了本节,以便任何有点困惑的人都能准确地了解如何使用我们的模型进行预测。 我们选择的模型是Orthogonal Matching Pursuit CV,因此它是我们用于预测的模型,如下面的代码片段所示:

    Code snippet for using our selected model for making predictions 使用我们选择的模型进行预测的代码片段

    Our regressor isn’t a 100% accurate but it’s pretty close and hence could be really used to predict the grades of students quite well.

    我们的回归指标并不是100%准确的,但它非常接近,因此可以真正用来很好地预测学生的成绩。

    It’s possible to use the other models to also make predictions and it’s something you might be interested in trying out as an extra exercise.

    可以使用其他模型进行预测,这可能是您有兴趣尝试做为额外练习的事情。

    Thank you for reading my article, please reach out to me if you have any questions to ask.

    感谢您阅读我的文章,如果您有任何疑问请与我联系。

    翻译自: https://medium.com/@kole.audu/predicting-high-school-students-grades-with-machine-learning-regression-3479781c185c

    机器学习回归预测

    相关资源:逻辑回归成绩查询源文件
    Processed: 0.014, SQL: 8