以一台机器为基准做pxe

科技2022-08-01 124

以一台机器为基准做pxe

This baseline to a machine learning program. The main goal of creating a baseline is creating a simple model that can be used to compare against other models, and to get some practice in loading the data in to the model. Loading the data in to the model is honestly usually about 50% of the work that will be done in any machine learning project. I will be using the data from https://www.kaggle.com/nehaprabhavalkar/av-healthcare-analytics-ii.

这是机器学习程序的基线。创建基线的主要目的是创建一个简单的模型，该模型可用于与其他模型进行比较，并获得一些将数据加载到模型中的实践。老实说，将数据加载到模型中通常是任何机器学习项目中将完成的工作的50％。我将使用https://www.kaggle.com/nehaprabhavalkar/av-healthcare-analytics-ii中的数据。

First thing you should do before coding at all is read about the data you have. Reading the description of the problem tells you that this is a healthcare management problem and we are trying to determine the length of stay for a patient given various attributes. Extra reading tells us the importance of our problem which is the improper management of patient capacity can cause unneeded deaths and complications.

在完全编码之前，您应该做的第一件事是阅读有关您所拥有的数据的信息。阅读问题描述，您会发现这是一个医疗保健管理问题，我们正在尝试确定给定各种属性的患者的住院时间。额外的阅读告诉我们我们问题的重要性，因为对患者能力的不当管理可能会导致不必要的死亡和并发症。

Now we should take a look at the files we have in the folder. We see there are four files, sample_sub.csv, test_data.csv, training_data.csv, and train_data_dictionary.csv.

现在，我们应该看看文件夹中的文件。我们看到有四个文件，sample_sub.csv，test_data.csv，training_data.csv和train_data_dictionary.csv。

We can see from the names of the files we can guess the sample_sub.csv is a sample submission for a competition for formatting, test_data.csv is the testing data, training_data.csv is the training data. and train_data_dictionary.csv is a dictionary that will describe the data. This last one is the most interesting to us as it tells us what each descriptor means. So lets quickly look at that though excel.

从文件名中可以看出，sample_sub.csv是格式竞赛的示例提交，test_data.csv是测试数据，training_data.csv是训练数据。 train_data_dictionary.csv是将描述数据的字典。最后一个对我们来说最有趣，因为它告诉我们每个描述符的含义。因此，尽管表现出色，但让我们快速了解一下。

A look at it shows that there are 18 variables.

看一看，发现有18个变量。

We can now have some basic expectations of our data. We should expect case_id to be the unique identifier for each case. We should expect patientid to be the unique identifier for each patient. We should expect hospital_code, hospital_type_code, city_code_hospital, hospital_region_code, department, ward_type, ward_facility_code, city_code_patient and type of admission, to be nominal. So if we see for one patient hospital_code is 1 and for another patient hospital_code is 2, then there is no relation between the two numbers. The other variables are ordinal (or more descriptive) where there are a relation between the two numbers. If we see for one patient that bed grade is 1 and another patient the bed grade is 2 then we know that one patient got the better bed.

现在，我们可以对数据有一些基本的期望。我们应该期望case_id是每种情况的唯一标识符。我们应该期望Patientid是每个患者的唯一标识符。我们应该期望医院代码，医院类型代码，城市代码医院，医院区域代码，部门，病房类型，病房设施代码，城市代码患者和入院类型为名义。因此，如果我们看到一个患者的hospital_code为1，而另一患者的hospital_code为2，则这两个数字之间没有关系。在两个数字之间存在关系的情况下，其他变量为序数(或更具有描述性)。如果我们发现一位患者的床位等级为1，而另一位患者的床位等级为2，则我们知道一位患者的床位更好。

Now that we have done that we should load the program, and make sure the data reflects our understanding of the data.

现在我们已经完成了工作，我们应该加载程序，并确保数据反映了我们对数据的理解。

First things first we must import our packages.

首先，我们必须导入软件包。

import osimport pandas as pdimport numpy as np

Now we should load up our data.

现在我们应该加载我们的数据。

train = pd.read_csv(‘train_data.csv’)test = pd.read_csv(‘test_data.csv’)

We will be manipulating both the training and testing data at the same time so they should be merged in to one dataframe but first we need to give both datasets identifiers so that we can separate them again later on.

我们将同时处理训练数据和测试数据，因此应将它们合并到一个数据框中，但是首先我们需要提供两个数据集标识符，以便稍后再将它们分离。

train[‘dataset’] = ‘train’test[‘dataset’] = ‘test’df = pd.concat([train, test])

It is a safe assumption that our case_id will not be useful so we can remove it.

有一个安全的假设，即case_id将无用，因此可以将其删除。

We should check for NAs before we do anything else first. By running df.isnull().sum() we can see that there are null values in the bed grade variable, the City_Code_Patient variable, and the stay variable. Since those are all categorical variables we should replace them with the most common value for that variable.

在执行其他任何操作之前，我们应该检查NA。通过运行df.isnull()。sum()，我们可以看到床坡度变量，City_Code_Patient变量和stay变量中存在空值。由于这些都是分类变量，因此我们应该用该变量的最常见值替换它们。

df.groupby(“Bed Grade”)[‘patientid’].nunique().reset_index()df.groupby(“City_Code_Patient”)[‘patientid’].nunique().reset_index()df[“Bed Grade”].fillna(2.0, inplace = True)df[“City_Code_Patient”].fillna(8.0, inplace = True)df.isnull().sum()

Now we can develop some graphs to take a look at the data.

现在我们可以开发一些图形来查看数据。

This first graph is made to look at the differences between the training and testing test. We want the program to make many different graphs based on what we put in for the descriptive variable, so we will create a function that will take in a descriptive variable and return a graph.

制作第一张图是为了看一下训练测试与测试测试之间的差异。我们希望程序根据为描述变量输入的内容制作许多不同的图，因此我们将创建一个函数，该函数将接受描述变量并返回图。

First the data will need to be grouped by the variable and by dataset type. Then it should take count of the unique patients based on the grouping. After that the graph will be made using a bar graph, where we have to label the x and y positions. All of this is done with the function below.

首先，需要根据变量和数据集类型对数据进行分组。然后，应根据分组对唯一患者计数。之后，将使用条形图制作该图，在该图中我们必须标记x和y位置。所有这些都是通过以下功能完成的。

def graph_gen(column): ds = df.groupby([column, 'dataset'])['patientid'].count().reset_index() # set width of bar barWidth = 0.50 ds_test = ds[ds['dataset'] == "test"] ds_train = ds[ds['dataset'] == 'train'] r1 = np.arange(len(ds_test))*2 r2 = [x + barWidth for x in r1] plt.figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k') plt.bar(r1, ds_test['patientid'], color='#7f6d5f', width=barWidth, edgecolor='white', label='test') plt.bar(r2, ds_train['patientid'], color='#557f2d', width=barWidth, edgecolor='white', label='train') plt.xticks(r2, ds_test[column]) plt.title(column) plt.legend() plt.show()

Above is the function running on the department variable.

上面是在部门变量上运行的函数。

A quick glance at all of the graphs show that there is no skewing between the test and training set.

快速浏览所有图表，表明测试集和训练集之间没有任何偏差。

Now to check if there is a skew in the response variable. To do that we should subset the response variable then look at the unique patient id for each response in the response variable. Then we have to graph the resulting dataset.

现在检查响应变量中是否存在偏斜。为此，我们应该将响应变量子集化，然后在响应变量中查看每个响应的唯一患者ID。然后，我们必须绘制结果数据集的图形。

temp = df.groupby(“Stay”)[‘patientid’].nunique().reset_index()tempx = np.arange(len(temp))plt.bar(tempx ,temp[‘patientid’], color=’#7f6d5f’, width=.5, edgecolor=’white’, label=’test’)plt.xticks(tempx, temp[‘Stay’])plt.show()

We can see there is an obvious skew towards shorter stay times than longer ones and this skew should be dealt with.

我们可以看到有一个明显的偏差，即与更长的停留时间相比，停留时间更短，应该解决该偏差。

Some extra exploration could be done using the .describe() method which will give the count, mean, std, min, max, quartiles, most frequent value, and frequency of top value.

可以使用.describe()方法进行一些额外的探索，该方法将给出计数，均值，标准，最小值，最大值，四分位数，最频繁值和最高值频率。

Now back to those nominal variables, all of those nominal variables now need to be encoded properly. Thankfully pandas provides a function called called .get_dummies() using this we can now make our dummy variables.

现在回到这些名义变量，所有这些名义变量现在都需要正确编码。值得庆幸的是，熊猫提供了一个名为.get_dummies()的函数，我们现在可以使用它创建虚拟变量。

df = pd.get_dummies(df, columns=[“Hospital_code”, “Hospital_type_code”, “City_Code_Hospital”, “Hospital_region_code”, “Department”, “Ward_Type”, “Ward_Facility_Code”, “City_Code_Patient”, “Type of Admission”], prefix=[“Type_is”,”Type_is”,”Type_is”,”Type_is”,”Type_is”,”Type_is”,”Type_is”,”Type_is”,”Type_is”])

Now the ordinal variables will have to be encoded correctly. This can be done with the .map method in dataframes and a dictionary.

现在，序数变量将必须正确编码。可以使用数据框和字典中的.map方法完成此操作。

df[“Severity of Illness”].unique()df[“Severity of Illness”] = df[“Severity of Illness”].map({‘Extreme’:3, ‘Moderate’:2, ‘Minor’:1})df[“Age”] = df[“Age”].map({“0–10”:0, “11–20”:1, “21–30”:2, “31–40”:3, “41–50”:4, “51–60”:5, “61–70”:6, “71–80”:7, “81–90”:8, “91–100”:9,})df[“Stay”] = df[“Stay”].map({“0–10”:0, “11–20”:1, “21–30”:2, “31–40”:3, “41–50”:4, “51–60”:5, “61–70”:6, “71–80”:7, “81–90”:8, “91–100”:9, ‘More than 100 Days’:10})

Now we can create our final training, testing, and validation set. We first separate the training set from the testing set, and load up train_test_split. After separating the training and testing set we can remove the dataset variable since we will not need that anymore.

现在，我们可以创建最终的培训，测试和验证集。我们首先将训练集与测试集分离，然后加载train_test_split。分离训练和测试集后，我们可以删除数据集变量，因为我们将不再需要它。

from sklearn.model_selection import train_test_splittrain = df[df[“dataset”] == “train”]test = df[df[“dataset”] == “test”]train = train.drop(“dataset”, axis = 1)test = test.drop(“dataset”, axis = 1)train, valid = train_test_split(train, test_size=.2, stratify = train[“Stay”])

Now we generate the X and y variables for each training and testing set. This is easily done by subsetting by the Stay variable then dropping the Stay variable from the X dataset.

现在，我们为每个训练和测试集生成X和y变量。通过使用Stay变量进行子设置，然后从X数据集中删除Stay变量，可以轻松完成此操作。

Now we can make our model. Since the model is just a baseline it will be very basic and easy to implement. It will only have 3 lines.

现在我们可以建立模型了。由于该模型只是一个基准，因此它将非常基础且易于实现。它只有3行。

from sklearn.ensemble import RandomForestClassifierRf = RandomForestClassifier(oob_score=True) model = Rf.fit(x_train, y_train)preds = model.predict(x_train)train_acc = (preds == y_train).sum()/len(preds)preds = model.predict(x_valid)valid_acc = (preds == y_valid).sum()/len(preds)

Now that we have made our model we can use our validation set to validate the model and see how well it did.

现在我们已经建立了模型，我们可以使用验证集来验证模型并查看其效果如何。

The results from this show that on our training set we have 99% accuracy which is great but on our validation set we only have 41% accuracy which is not very good. This large gap between the two values show that our model is not generalizing well.

结果表明，在我们的训练集上，我们有99％的准确性，这是很好的，但在我们的验证集上，我们只有41％的准确性，这不是很好。这两个值之间的巨大差距表明我们的模型不能很好地概括。

However this is simply a baseline and this baseline result is similar to other baselines currently out there, so this is where I will stop. I will be writing another story on how to improve your result after this.

但是，这仅仅是一个基准，并且该基准结果与当前存在的其他基准类似，因此我将在此停止。之后，我将写另一个故事，讲述如何改善您的结果。

翻译自: https://medium.com/@ihoc_h_a/a-creating-a-baseline-for-a-machine-learning-problem-761898ef274c

以一台机器为基准做pxe

相关资源：微信小程序源码-合集6.rar

Processed: 0.009, SQL: 8