数据科学通往更安全道路的道路

    科技2023-12-24  98

    A treatise on Seattle’s car collision data

    关于西雅图汽车碰撞数据的论文

    Hindsight is a wonderful thing but foresight is better, especially when it comes to saving life, or some pain!

    后见之明是一件奇妙的事,但有远见却更好,特别是在挽救生命或痛苦方面!

    — William Blake

    —威廉·布雷克

    Prerequisite: Basic knowledge of data processing, machine learning using Python

    先决条件:数据处理基础知识,使用Python进行机器学习

    介绍(Introduction)

    According to the statistics by WHO (7th Feb 2020):

    根据世界卫生组织的统计(2020年2月7日) :

    Every year the lives of approximately 1.35 million people are cut short as a result of a road traffic crash. Between 20 and 50 million more people suffer non-fatal injuries, with many incurring a disability as a result of their injury.

    每年由于道路交通事故而缩短约135万人的生命。 非致命性伤亡人数增加了20至5000万人,许多人因受伤而致残。

    According to the National Safety Council, traffic collisions cause more than 40,000 deaths and injure thousands of people every year across the United States. These are not traffic accidents, but entirely preventable tragedies.

    根据国家安全委员会的调查,交通事故每年在美国造成40,000多人死亡,数千人受伤。 这些不是交通事故,而是完全可以预防的悲剧。

    目的 (Objective)

    Undoubtedly, we appreciate the prediction of severity in car accidents (based on the historic data) is crucial and machine learning is ideally suited for this purpose. In this article, we are going to explore, analyze and model real-world car collision data published by the Govt. of Seattle. In this journey, we will touch upon cleaning data, data imputation, feature selection, tackling extremely skewed data and finally evaluate different AI model formulations.

    毫无疑问,我们感谢对车祸严重性的预测(基于历史数据)至关重要,并且机器学习非常适合于此目的。 在本文中,我们将探索,分析和建模由政府提供的真实世界的汽车碰撞数据。 西雅图在此过程中,我们将涉及清理数据,数据插补,特征选择,处理极其偏斜的数据,最后评估不同的AI模型公式。

    兴趣爱好 (Interests)

    The practical utilities of the prediction, besides saving lives:

    该预测的实际用途,除了挽救生命外:

    Safe route planning

    安全路线规划 Emergency vehicle allocation

    紧急车辆分配Roadway design

    巷道设计Reduce property damage

    减少财产损失Where to place additional signage (e.g. to warn for accident-prone areas)

    在哪里放置其他标牌(例如,警告易发生事故的区域)

    资料载入 (Data Loading)

    The car collision data is obtained from Seattle Govt’s website (Time frame: 2004 to Present).

    汽车碰撞数据可从Seattle Govt的网站获得(时间范围:2004年至今)。

    Upon loading the data in Pandas dataframe (df), we immediately see the profile (using pandas-profiling).

    将数据加载到Pandas数据框( df )中后,我们立即看到了配置文件(使用pandas-profiling )。

    The overview of the data profile report:

    数据配置文件报告的概述:

    pandas-profiling pandas分析的数据配置文件报告

    We see that the severity level of car accidents to be predicted (SEVERITYCODE) are listed against 39 independent variables (features). Some important features are summarized below to get a feeling of what we are dealing with (readers are urged to have a quick look at the entire list published on Govt. of Seattle’s website once):

    我们看到,针对39个独立变量(特征)列出了要预测的车祸严重程度( SEVERITYCODE )。 下面总结了一些重要功能,以使我们了解正在处理的内容(敦促读者快速浏览一下西雅图网站上发布的整个列表):

    探索性数据分析(EDA) (Exploratory Data Analysis (EDA))

    In this phase, we concentrate on Cleaning up missing data, Value imputation, Value re-grouping and Data visualization.

    在此阶段中,我们专注于清理丢失的数据,价值估算,价值重新分组和数据可视化。

    A) Severity Code: Severity Code (SEVERITYCODE) is the target/dependent variable. Let us scrutinize that first. There are four codes = 0, 1, 2, 2b and 3. Since we are not going to predict an ‘Unknown’ severity (SEVERITYCODE = 0), these observations, along with the rows with missing values can safely be deleted. The categorical values can also be remapped to a scale of 1 to 4, where 3 is assigned to 4 and 2b to 3. The new mapping is as follows.

    A)严重性代码:严重性代码( SEVERITYCODE )是目标/因变量。 让我们先仔细检查一下。 有四个代码= 0、1、2、2b和3。由于我们不会预测'未知'严重性( SEVERITYCODE = 0),因此可以安全删除这些观察值以及缺少值的行。 也可以将类别值重新映射为1到4的比例,其中3分配给4,2b分配给3。新的映射如下。

    This is to be noted that the highest severity (4 = Fatal) has only 0.18% of the number of observations, in other words, the distribution is extremely skewed.

    要注意的是,最高严重性(4 =致命)仅占观察次数的0.18%,换句话说,分布极度偏斜。

    Count of Severity Codes 严重等级代码计数

    B) Date Time:

    B)日期时间:

    Count of Accidents by Hour 按小时计算的事故数 Count of Accidents by Weekday (Day 0: Monday) 按工作日(第0天:星期一)计算的事故数 Count of Accidents by Month 按月计算的事故数

    C) Fixing missing values: We find that the features WEATHER, JUNCTIONTYPE, LIGHTCOND and ROADCOND have common characteristics — they have both missing values and a pre-existing category named ‘Unknown’. We assign the missing values to ‘Unknown’.

    C)修正缺失值:我们发现WEATHER , JUNCTIONTYPE , LIGHTCOND和ROADCOND这两个特征具有共同的特征-它们既具有缺失值,又具有名为“未知”的现有类别。 我们将缺少的值分配给“未知”。

    The features ADDRTYPE , COLLISIONTYPE, PEDROWNOTGRNT, SPEEDING and INATTENTIONIND do not have built-in ‘Unknown’ category, but we treat them as above.

    功能ADDRTYPE , COLLISIONTYPE,PEDROWNOTGRNT , SPEEDING和INATTENTIONIND没有内置的“未知”类别,但我们将它们与上述相同。

    D) Visualization: Here are some visualizations of different features that give us a first qualitative impression of how they are distributed.

    D)可视化:以下是一些不同功能的可视化,它们给我们关于它们如何分布的第一定性印象。

    Count of Accidents by Weather Condition 天气情况下的事故数 Count of Accidents by Junction Type 按路口类型分类的事故数 Count of Accidents by Light Condition 轻工事故数 Count of Accident by Road Condition Type 道路状况类型的事故计数 Count of Accidents by Address Type where the collision took place 发生冲突的地址类型的事故计数 Count of Accidents by Collision Type 碰撞类型的事故数 Count of Accidents by Collision Code 碰撞代码的事故计数 Count of Accidents by flag describing if a parked car was hit 通过标志描述意外事故的发生次数

    E) Under Influence Flag: The flag UNDERINFL indicates whether the driver was under the influence of alcohol/drug etc. has ambiguous data (‘Y’, 1, ‘N’, 0 and missing values), but we can assume ‘N’ maps to ‘0’ and ‘Y’ maps to ‘1’.

    E)有影响力标志:标志UNDERINFL指示驾驶员是否受到酒精/毒品等的影响,其数据有歧义(“ Y”,“ 1”,“ N”,0和缺失值),但是我们可以假设“ N”映射为“ 0”,“ Y”映射为“ 1”。

    Count of under-influence flag 影响力不足标志的计数

    F) Inspect Injuries, Serious Injuries and Fatalities Features: There are three variables: Injuries, Serious injuries and Fatalities. Let us see how the numbers are distributed among the four severity codes we have.

    F)检查伤害,严重伤害和死亡特征:有三个变量:伤害,严重伤害和死亡。 让我们看看数字如何在我们拥有的四个严重性代码之间分配。

    The matrix is indicating a very strong correlation with severity.

    该矩阵表明与严重程度之间的关联非常强。

    Correlation of the injury variables 伤害变量的相关性

    Since, apart from the severity code = 1 (“Property Damage Only Collision”), severity code is assigned based on the injury level and the former is a direct reflection of the latter. If we use injury features as predictors, it is easily understood that those will overwhelm the other features, and the prediction will be based on the after-effects of a collision. Therefore, these three features will be ignored.

    由于除了严重性代码= 1(“仅财产损失碰撞”)之外,严重性代码是根据伤害级别分配的,而前者是后者的直接反映。 如果我们使用伤害特征作为预测因子,则很容易理解这些特征将使其他特征不堪重负,并且预测将基于碰撞的后遗症。 因此,这三个功能将被忽略。

    G) Grouping junction type and collision type with severity code:

    G)用严重性代码对接合点类型和碰撞类型进行分组:

    H) Mode values of the features: It is interesting to see the mode (highest frequency) values of each of the features concerning the severity codes (where S indicates severity code).

    H)特征的模式值:有趣的是,每个特征的模式值(最高频率)都与严重性代码有关(其中S表示严重性代码)。

    功能选择 (Feature Selection)

    Let us select the features of interest (ADDRTYPE, COLLISIONTYPE, JUNCTIONTYPE, INATTENTIONIND, UNDERINFL, WEATHER, ROADCOND, LIGHTCOND, PEDROWNOTGRNT, SPEEDING, ST_COLCODE, HITPARKEDCAR), and one-hot encode them.

    让我们选择感兴趣的功能( ADDRTYPE,COLLISIONTYPE,JUNCTIONTYPE,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SPEEDING,ST_COLCODE,HITPARKEDCAR) ,并对它们进行一键编码。

    Finally, all the columns having ‘Unknown’ in their headers (that were produced during one-hot encoding) are discarded as they are minorities as well as not informative.

    最后,所有在标题中具有“未知”的列(在一次热编码期间生成)都将被丢弃,因为它们是少数族裔,并且没有提供信息。

    多类偏斜分布的分类模型 (Classification Models for Multi-class, Skewed Distribution)

    This extremely skewed and multi-class data may not be amenable even to the specialized classification models that deal with unbalanced data, we go ahead and have a taste of their performance, nevertheless. Here we have chosen the following classifiers, capable of handling unbalanced data inherently, for our study:

    即使对于处理不平衡数据的专业分类模型,这种极度倾斜和多类的数据也可能无法接受,尽管如此,我们还是继续尝试一下它们的性能。 在这里,我们为研究选择了以下分类器,它们能够固有地处理不平衡数据:

    i) Bagging

    i)套袋

    ii) Balanced Bagging

    ii)平衡袋

    iii) Balanced Random Forest and

    iii)平衡随机森林和

    iv) EasyEnsemble

    iv) EasyEnsemble

    import itertools import matplotlib.pyplot as plt import numpy as np from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix from sklearn.metrics import balanced_accuracy_score from sklearn.ensemble import BaggingClassifier from imblearn.ensemble import BalancedBaggingClassifier from imblearn.ensemble import BalancedRandomForestClassifier from imblearn.ensemble import EasyEnsembleClassifier # -------------------------------------------------------------------------------- # Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com> # License: MIT # -------------------------------------------------------------------------------- def plot_confusion_matrix(cm, classes, ax, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues): """ This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. """ ax.imshow(cm, interpolation='nearest', cmap=cmap) ax.set_title(title) tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=0) plt.sca(ax) plt.yticks(tick_marks, classes) fmt = '.2f' if normalize else 'd' thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): ax.text(j, i, format(cm[i, j], fmt), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") ax.set_ylabel('True label') ax.set_xlabel('Predicted label') # -------------------------------------------------------------------------------- # -------------------------------------------------------------------------------- # Target variable target='SEVERITYCODE' # set X and y y = df[target] X = df.drop(target, axis=1) X = preprocessing.StandardScaler().fit(X).transform(X) # Split the data set into training and testing data sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21, stratify=y) bagging = BaggingClassifier(n_estimators=50, random_state=0, n_jobs=-1) balanced_bagging = BalancedBaggingClassifier(n_estimators=50, random_state=0, n_jobs=-1) brf = BalancedRandomForestClassifier(n_estimators=50, random_state=0, n_jobs=-1) eec = EasyEnsembleClassifier(n_estimators=10, n_jobs=-1) bagging.fit(X_train, y_train) balanced_bagging.fit(X_train, y_train) brf.fit(X_train, y_train) eec.fit(X_train, y_train) y_pred_bc = bagging.predict(X_test) y_pred_bbc = balanced_bagging.predict(X_test) y_pred_brf = brf.predict(X_test) y_pred_eec = eec.predict(X_test) fig, ax = plt.subplots(ncols=4, figsize=(20,20)) cm_bagging = confusion_matrix(y_test, y_pred_bc) plot_confusion_matrix(cm_bagging, classes=np.unique(df[target]), ax=ax[0], title='Bagging\nBalanced accuracy: {:.2f}'.format(balanced_accuracy_score(y_test, y_pred_bc))) cm_balanced_bagging = confusion_matrix(y_test, y_pred_bbc) plot_confusion_matrix(cm_balanced_bagging, classes=np.unique(df[target]), ax=ax[1], title='Balanced bagging\nBalanced accuracy: {:.2f}'.format(balanced_accuracy_score(y_test, y_pred_bbc))) cm_brf = confusion_matrix(y_test, y_pred_brf) plot_confusion_matrix(cm_brf, classes=np.unique(df[target]), ax=ax[2], title='Balanced Random Forest\nBalanced accuracy: {:.2f}'.format(balanced_accuracy_score(y_test, y_pred_brf))) cm_eec = confusion_matrix(y_test, y_pred_eec) plot_confusion_matrix(cm_eec, classes=np.unique(df[target]), ax=ax[3], title='Balanced EasyEnsemble\nBalanced accuracy: {:.2f}'.format(balanced_accuracy_score(y_test, y_pred_eec))) plt.show()

    The confusion matrices generated by the above code:

    上面的代码生成的混淆矩阵:

    Confusion matrices along with balanced accuracy for different models 混淆矩阵以及不同模型的平衡精度

    We see that in all the four cases, balanced accuracy did not even cross 50%.

    我们看到,在所有四种情况下,平衡精度甚至都没有超过50%。

    多班至两班 (Multi-class to Two-class)

    Often, the skewed multi-class classification problem is converted to the two-class problem by taking the minority class versus the group of the rest of the classes. In our situation, the accidents with severity level 4 are fatal and others are non-fatal. Therefore, we can focus on level 4 accidents and regroup the levels of severity into level 4 versus other levels. In this process, a new column ‘Severity 4’ is created.

    通常,通过将少数类与其他类的组进行比较,将偏斜的多类分类问题转换为两类问题。 在我们的情况下,严重性为4级的事故是致命的,其他事故不是致命的。 因此,我们可以专注于4级事故,并将严重性级别与其他级别重新组合为4级。 在此过程中,将创建一个新列“严重性4”。

    数据平衡 (Data Balancing)

    As seen above, severity 4 is extremely rare, or in other words, the data is highly skewed. The main challenge of dealing with this type of data is that the machine learning algorithms train with almost 100% accuracy and fails to classify the minority class. This is intuitive since when the occurrence of the majority class is 99% per cent, even if the classifier is hard-coded to predict majority class always, the accuracy will still be 99%.

    如上所示,严重性4非常罕见,换句话说,数据高度不对称。 处理此类数据的主要挑战在于,机器学习算法的训练精度几乎达到100%,并且无法对少数类别进行分类。 这是直观的,因为当多数类的出现率为99%时,即使分类器始终进行硬编码以始终预测多数类,其准确性仍将为99%。

    We appreciate that false negative is very costly here, that is actual severity code 4 is not predicted. The situation is just like the detection of fraudulent transactions or diagnosing diseases.

    我们意识到,假阴性在这里非常昂贵,也就是说,不会预测实际的严重性代码4。 这种情况就像检测欺诈交易或诊断疾病一样。

    There are many ways to deal with this situation by balancing the data synthetically by exploration method before training. We might (1) under-sample the majority class, (2) over-sample the minority class or (3) have a combination of (1) and (2), i.e. over- and under-sample simultaneously.

    在训练之前,有很多方法可以通过探索方法综合平衡数据来处理这种情况。 我们可能(1)对多数类别进行欠采样,(2)对少数类别进行过度采样,或者(3)具有(1)和(2)的组合,即同时进行过度采样和欠采样。

    The combination of over- and under-sampling will be used since the data is large enough, level 4 will be randomly over-sampled to 10000 and other levels will be randomly under-sampled to 10000.

    由于数据足够大,因此将使用过采样和欠采样的组合,级别4将随机地过采样到10000,其他级别将随机地过采样到10000。

    相关性 (Correlation)

    Let us now get an idea of how the variables are correlated. The variable ST_COLCODE is excluded here as this has a long list making the plot exceptionally tall.

    现在让我们了解变量如何关联。 此处不包括变量ST_COLCODE,因为它有一个很长的列表,使绘图特别高。

    We can see that the variables are not highly correlated.

    我们可以看到变量没有高度相关。

    分类模型(适用于平衡数据) (Classification Models (applied to balanced data))

    We are going to consider the models:

    我们将考虑以下模型:

    i) Logistic Regression

    i)逻辑回归

    ii) k-Nearest Neighbors (kNN)

    ii)k最近邻居(kNN)

    iii) Decision Tree

    iii)决策树

    iv) Random Forest

    iv)随机森林

    The results of the above four model are summarized below.

    以上四个模型的结果总结如下。

    a) Accuracy: Accuracies achieved by different algorithms are shown here.

    a)准确性:此处显示了通过不同算法获得的准确性。

    b) Confusion matrices:

    b)混淆矩阵:

    c) Feature importance: Relative importance of features for

    c)特征重要性:特征对于

    i) Decision Tree

    i)决策树

    ii) Random Forest

    ii)随机森林

    推理(Inference)

    We can conclude that the Random Forest is the best model in this scenario, with Decision Tree and the other models are almost the same). An interesting point to note here is that the top important features are somewhat different between the Random Forest and the Decision Tree model. Following the Random Forest model, we see that special attention needs to be given to pedestrians (topmost important feature), speeding, collision with a parked car, rear-ended collision, drivers under influence of alcohol/drug. This result is very much expected. The collision codes 50 (Struck Fixed Object), 32 (One Parked — One Moving), 10 (Entering At Angle) and 14 (From Same Direction — Both Going Straight — One Stopped — Rear End) are the major influencers.

    我们可以得出结论,在这种情况下,随机森林是最好的模型,决策树和其他模型几乎相同)。 这里需要注意的有趣一点是,随机森林和决策树模型之间的最重要的特征有所不同。 遵循随机森林模型,我们看到需要特别注意行人(最重要的特征),超速,与停放的汽车相撞,追尾碰撞,驾驶员在酒精/毒品的影响下。 这个结果非常值得期待。 主要的影响因素包括碰撞代码50(撞击固定对象),32(停放一移动),10(以一定角度进入)和14(从同一方向-都直行-一个停止-后端)。

    未来研究 (Future Study)

    The relations between the key features and accident severity can be further studied in details

    关键特征与事故严重性之间的关系可以进一步详细研究Different data balancing techniques can be applied and evaluated

    可以应用和评估不同的数据平衡技术Development of a much more complex real-time accident risk prediction model

    开发更加复杂的实时事故风险预测模型

    翻译自: https://medium.com/rigging-real-artificially/data-science-a-road-to-safer-roads-f713a3be97d

    相关资源:微信小程序源码-合集6.rar
    Processed: 0.017, SQL: 9