ml模型

    科技2023-12-30  69

    ml模型

    In my last blogs, I explained types of missing values and different ways to handle Continous and Categorical missing values with implementation.

    在我的上一篇博客中,我解释了缺失值的类型以及在实现中处理连续和分类缺失值的不同方法。

    After handle missing values in the dataset, the next step was to handle categorical data. In this blog, I will explain different ways to handle categorical features/columns along with implementation using python.

    处理完数据集中的缺失值之后,下一步就是处理分类数据。 在此博客中,我将说明使用python处理分类特征/列的不同方法。

    Introduction: All Machine Learning models are some kind of mathematical model that needs numbers to work with. Categorical data have possible values (categories) and it can be in text form. For example, Gender: Male/Female/Others, Ranks: 1st/2nd/3rd, etc.

    简介:所有机器学习模型都是需要数字才能使用的某种数学模型。 分类数据可能具有值(类别),并且可以采用文本形式。 例如,性别:男/女/其他,等级:第一/第二/第三等。

    While working on a data science project after handling the missing value of datasets. The next work is to handle categorical data in datasets before applying any ML models.

    处理数据集的缺失值后,在数据科学项目中工作。 接下来的工作是在应用任何ML模型之前处理数据集中的分类数据。

    First, let’s understand the types of categorical data:

    首先,让我们了解分类数据的类型:

    Nominal Data: The nominal data called labelled/named data. Allowed to change the order of categories, change in order doesn’t affect its value. For example, Gender (Male/Female/Other), Age Groups (Young/Adult/Old), etc.

    标称数据:标称数据称为标记/命名数据。 允许更改类别的顺序,顺序的更改不会影响其值。 例如,性别(男性/女性/其他),年龄段(年轻人/成人/老人)等。

    Ordinal Data: Represent discretely and ordered units. Same as nominal data but have ordered/rank. Not allowed to change the order of categories. For example, Ranks: 1st/2nd/3rd, Education: (High School/Undergrads/Postgrads/Doctorate), etc.

    顺序 数据:代表离散单位和有序单位 。 与标称数据相同,但已排序/排名。 不允许更改类别的顺序。 例如,等级:1st / 2nd / 3rd,教育程度:(高中/本科/研究生/博士)等。

    Ways to handle categorical features:

    处理分类特征的方法:

    The dataset used to explain is Titanic (Kaggle dataset):

    用于解释的数据集是Titanic( Kaggle数据集 ):

    import pandas as pdimport numpy as npData = pd.read_csv("train.csv")Data.isnull().sum() DataType — Object is categorical features in the dataset. DataType —对象是数据集中的分类特征。

    Create Dummies

    创建假人

    Description: Create dummies or binary type columns for each category in the object/ category type feature. The value for each row is 1 if that category is available in that row else 0. To create dummies use pandas get_dummies() function.

    说明:为对象/类别类型功能中的每个类别创建虚拟变量或二进制类型的列。 如果该类别在该行中可用,则每行的值为1,否则为0。要创建虚拟变量,请使用pandas get_dummies()函数。

    Implementation:

    实现方式:

    DataDummies = pd.get_dummies(Data)DataDummies Example: Passenger class create 3 new columns. 示例:乘客舱创建3个新列。

    Advantage:

    优点:

    Easy to use and fast way to handle categorical column values.

    易于使用和快速的方式来处理分类列值。

    Disadvantage:

    坏处:

    get_dummies method is not useful when data have many categorical columns.

    当数据具有许多分类列时,get_dummies方法不起作用。 If the category column has many categories leads to add many features into the dataset.

    如果类别列具有许多类别,则会导致向数据集中添加许多要素。

    Hence, This method is only useful when data having less categorical columns with fewer categories.

    因此,该方法仅在具有较少分类列且类别较少的数据时有用。

    2. Ordinal Number Encoding

    2.序数编码

    Description: When the categorical variables are ordinal, the easiest approach is to replace each label/category by some ordinal number based on the ranks. In our data Pclass is ordinal feature having values First, Second, Third so each category replaced by its rank i.e 1,2,3 respectively.

    描述:当分类变量为序数时,最简单的方法是根据等级用某个序数替换每个标签/类别。 在我们的数据中,Pclass是序数特征,其值分别为First,Second和Third,因此每个类别分别由其等级(即1,2,3)代替。

    Implementation:

    实现方式:

    Step 1: Create a dictionary with key as category and values with its rank.

    步骤1:创建一个字典,将键作为类别,并使用值作为其等级。

    Step 2: Create a new column and map the ordinal column with the created dictionary.

    步骤2:创建一个新列,然后将序数列与创建的字典进行映射。

    Step 3: Drop the original column.

    步骤3:删除原始列。

    # 1. PClassDict = { 'First':1, 'Second':2, 'Third':3, }# 2. Data['Ordinal_Pclass'] = Data.Pclass.map(PClassDict)# Display result Data[['PassengerId', 'Pclass', 'Ordinal_Pclass']].head(10)# 3.Data = Data.drop('Pclass', axis = 1)

    Advantage:

    优点:

    The easiest way to handle the ordinal feature in the dataset.

    处理数据集中序数特征的最简单方法。

    Disadvantage:

    坏处:

    Not good for Nominal type features in the dataset.

    不利于数据集中的名义类型特征。

    3. Count / Frequency Encoding

    3.计数/频率编码

    Description: Replace each category with its frequency/number of time that category occurred in that column.

    说明:将每个类别替换为该类别在该列中出现的频率/时间。

    Implementation:

    实现方式:

    Step 1. Create Dictionaries with key as category name and value with a count of categories i.e frequency of that category in each categorical column.

    步骤1.创建带有关键字作为类别名称和具有类别计数的值的词典,即在每个类别列中该类别的频率。

    Step 2. Create a new column which acts as a weight for that category and map with its respective dictionary.

    步骤2.创建一个新列作为该类别的权重,并映射其各自的字典。

    Step 3. Drop Orginal Columns.

    步骤3.删除原始列。

    # 1.Pclass_Dict = Data['Pclass'].value_counts()Salutation_Dict = Data['Salutation'].value_counts()Sex_Dict = Data['Sex'].value_counts()Embarked_Dict = Data['Embarked'].value_counts()Cabin_Serial_Dict = Data['Cabin_Serial'].value_counts()Cabin_Dict = Data['Cabin'].value_counts()# 2.Data['Encoded_Pclass'] = Data['Pclass'].map(Pclass_Dict)Data['Salutation_Dict'] = Data['Salutation'].map(Salutation_Dict)Data['Sex_Dict'] = Data['Sex'].map(Sex_Dict)Data['Embarked_Dict'] = Data['Embarked'].map(Embarked_Dict)Data['Cabin_Serial_Dict'] = Data['Cabin_Serial'].map(Cabin_Serial_Dict)Data['Cabin_Dict'] = Data['Cabin'].map(Cabin_Dict)# Display ResultData[['Pclass','Encoded_Pclass','Salutation','Salutation_Dict','Sex' ,'Sex_Dict','Embarked','Embarked_Dict','Cabin_Serial','Cabin_Serial_Dict','Cabin','Cabin_Dict']].head(10)# 3. Data = Data.drop(['Pclass','Salutation','Sex','Embarked','Cabin_Serial','Cabin'], axis = 1) Each category and its frequency count. 每个类别及其频率计数。

    Advantage:

    优点:

    East to implement.

    东实施。 Not increasing any extra features.

    没有增加任何额外的功能。

    Disadvantage:

    坏处:

    Not able to handle the same number of categories i.e provide the same values to both categories.

    无法处理相同数量的类别,即为两个类别提供相同的值。

    4. Target/Guided Encoding

    4.目标/引导编码

    Description: Here, the category of the column has been replaced with its depending join probability ranking with respect to Target column.

    说明:此处,该列的类别已被替换为针对“目标”列的依存连接概率排名。

    Implementation: To show the implementation I am using Cabin column with respect to Survived target column. The same steps are applicable for any ordinal column in the dataset.

    实现:为了显示实现,我使用“机舱”列相对于“幸存目标”列。 相同的步骤适用于数据集中的任何序数列。

    Step 1. Replace original cabin value with the first character of the cabin name.

    步骤1.用机舱名称的第一个字符替换原始机舱值。

    Step 2. Calculate the joint probability of each category based on the target column value.

    步骤2.根据目标列值计算每个类别的联合概率。

    Step 3. Create a list with sorted index in ascending order of join probabilities.

    步骤3.创建一个列表,该列表具有按连接概率升序排列的索引。

    Step 4. Create a dictionary where key as category name in cabin and values as joint probability ranking.

    步骤4.创建一个字典,其中键作为机舱中的类别名称,值作为联合概率排名。

    Step 5. Create a new column and map cabin values with dictionary joint probability ranking.

    步骤5.创建一个新列,并用字典联合概率排名映射机舱值。

    Step 6. Delete original cabin column.

    步骤6.删除原始机舱列。

    # 1.Data['Cabin'] = Data['Cabin'].astype(str).str[0]# 2.Data.groupby(['Cabin'])['Survived'].mean()# 3.Encoded_Lables = Data.groupby(['Cabin']) ['Survived'].mean().sort_values().index# 4.Encoded_Lables_Ranks = { k:i for i, k in enumerate(Encoded_Lables, 0) }# 5.Data['Cabin_Encoded'] = Data['Cabin'].map(Encoded_Lables_Ranks)# 6.Data = Data.drop('Cabin', axis = 1) Cabin Values with Join Probability ranking with respect to target column. 关于目标列的具有“结合概率”排名的机舱值。

    Advantages:

    优点:

    It doesn’t affect the volume of the data i.e not add any extra features.

    它不会影响数据量,即不添加任何额外功能。 Helps the machine learning model to learn faster.

    帮助机器学习模型更快地学习。

    Disadvantages:

    缺点:

    Typically, mean or joint probability encoding leads for over-fitting.

    通常,均值或联合概率编码会导致过度拟合。 Hence, to avoid overfitting cross-validation or some other approach is required most of the time.

    因此,为避免过度拟合交叉验证或其他大多数方法,在大多数情况下都需要这样做。

    5. Mean Encoding

    5.均值编码

    Description: Simillar to target/guided encoding only difference is here we replace category with the mean value with respect to target column. Here also we implement with cabin and survived target column.

    说明:与目标/引导编码的唯一区别是在此处我们将类别替换为相对于目标列的平均值。 在这里,我们也用机舱和幸存的目标列实施。

    Implementation:

    实现方式:

    Step 1. Calculate the mean for each category in the cabin column with respect to the target column (Survived).

    步骤1.计算相对于目标列(生存)的机舱列中每个类别的平均值。

    Step 2. Create a new column and replace by mean i.e map cabin column categories with its encoded mean dictionary.

    步骤2.创建一个新列,并用均值替换,即用其编码的均值字典映射机舱列类别。

    Step 3. Drop original cabin column

    步骤3.放下原始机舱列

    # 1.Encoded_Mean_Dict = Data.groupby(['Cabin'])['Survived'].mean().to_dict()# 2.Data['Cabin_Mean_Encoded'] = Data['Cabin'].map(Encoded_Mean_Dict)# Display resultData[['Cabin','Cabin_Mean_Encoded']].head()# 3.Data = Data.drop('Cabin', axis = 1) Cabin categories with its corresponding mean with respect to target column. 客舱类别及其相对于目标栏的相应平均值。

    Advantages:

    优点:

    Capture information within labels or categories, rendering more predictive features.

    在标签或类别中捕获信息,从而提供更多的预测功能。 Create a monotonous relationship between the independent variable and the target variable.

    在自变量和目标变量之间创建单调关系。

    Disadvantages:

    缺点:

    May leads to overfit the model, to overcome this problem cross-validation is use most of the time.

    可能导致模型过度拟合,克服这个问题的大部分时间是交叉验证。

    6. Probability Ratio Encoding

    6.概率比编码

    Description: Here category of the column is replaced with a probability ratio with respect to Target variable. Here I am using Cabin as an independent variable and its categories is replaced with a probability ratio of person Survived vs Died in each cabin.

    说明:在此列的类别被替换为相对于目标变量的概率比。 在这里,我将客舱用作自变量,并且将其类别替换为每个客舱中幸存者与死亡者的概率比。

    Implementation:

    实现方式:

    Step 1. Replace original cabin value with the first character of the cabin name.

    步骤1.用机舱名称的第一个字符替换原始机舱值。

    Step 2. Find percentage (%) of people survived in a particular cabin and store into new dataframe.

    步骤2.查找在特定机舱中幸存的人数百分比(%),并将其存储到新的数据框中。

    Step 3. Create a new column into probability survived dataframe with the probability of people getting dead in a particular cabin.

    步骤3.在概率幸存的数据框中创建一个新的列,其中有人在特定机舱中死亡的概率。

    Step 4. Create one more new column into probability survived dataframe i.e ratio of survived and died probability.

    步骤4.在生存概率数据框中再创建一个新列,即生存概率和死亡概率之比。

    Step 5. Create a dictionary with probability ratio column.

    步骤5.创建带有概率比率列的字典。

    Step 6. Create a new column in Data and replace by mapping cabin column categories with its encoded probability ratio dictionary.

    步骤6.在Data中创建一个新列,并用其编码概率比字典映射机舱列类别来代替。

    Step 7. Drop original cabin column.

    步骤7.放下原始机舱列。

    #1. Data['Cabin'] = Data['Cabin'].astype(str).str[0]# 2. Probability_Survived = Data.groupby(['Cabin'])['Survived'].mean()Probability_Survived = pd.DataFrame(Probability_Survived)# 3.Probability_Survived['Died'] = 1 - Probability_Survived['Survived']# 4.Probability_Survived['Prob_Ratio'] = Probability_Survived['Survived'] / Probability_Survived['Died']# 5.Encode_Prob_Ratio = Probability_Survived['Prob_Ratio'].to_dict()# 6.Data['Encode_Prob_Ratio'] = Data['Cabin'].map(Encode_Prob_Ratio)# Display resultData[['Cabin','Encode_Prob_Ratio']].head(10)# 7.Data = Data.drop('Cabin', axis = 1) Cabin categories and its corresponding probability ratio of survival. 机舱类别及其对应的生存概率比。

    Advantages:

    优点:

    Not increase any extra feature.

    不增加任何额外功能。 Captures information within the labels or category hence creates more predictive features.

    在标签或类别中捕获信息,从而创建更多的预测功能。 Creates a monotonic relationship between the variables and the target. So it’s suitable for linear models.

    在变量和目标之间创建单调关系。 因此,它适用于线性模型。

    Disadvantages:

    缺点:

    Not defined when the denominator is 0.

    分母为0时未定义。 Same as the above two methods lead to overfitting to avoid and validate usually cross-validation has been performed.

    与上述两种方法相同,过度拟合可以避免并验证通常已执行的交叉验证。

    Conclusion:

    结论:

    Hence, in this blogs, I try to explain about most widely used ways to handle categorical variables while preparing data for machine learning. The actual code notebook available at https://github.com/GDhasade/Medium.com_Contents/blob/master/Handle_Categorical_Data.ipynb

    因此,在本博客中,我尝试解释在准备用于机器学习的数据时处理分类变量的最广泛使用的方法。 实际的代码笔记本可在https://github.com/GDhasade/Medium.com_Contents/blob/master/Handle_Categorical_Data.ipynb中获得

    For more information please visit http://contrib.scikit-learn.org/category_encoders/index.html.

    有关更多信息,请访问http://contrib.scikit-learn.org/category_encoders/index.html 。

    翻译自: https://towardsdatascience.com/ways-to-handle-categorical-data-before-train-ml-models-with-implementation-ffc213dc84ec

    ml模型

    Processed: 0.015, SQL: 8