啤酒瓶啤酒盖换啤酒

科技2023-11-24 82

啤酒瓶啤酒盖换啤酒

机器学习(Machine Learning)

In this article, I will analyze the data of beer recipes in a dataset of almost 80,000 samples. With the use of Supervised Learning, I will attempt to estimate the Beer Typology from the Recipe process. The dataset has been downloaded from Kaggle from this link.

在本文中，我将分析将近80,000个样本的数据集中的啤酒配方数据。通过使用监督学习，我将尝试从配方过程中估算啤酒类型。该数据集已从Kaggle从此链接下载。

创建模型的步骤 (Steps to create the model)

Most of the time creating this model is actually spent preprocessing the data. The dataset is incomplete, and not being a time-series, but rather independent samples, I am unable to use interpolation to solve the problem.

实际上，创建此模型的大部分时间都花在了预处理数据上。数据集是不完整的，不是时间序列，而是独立的样本，我无法使用插值法解决问题。

Importing Dataset

导入数据集 Creating Functions

创建功能Preprocessing (2 stages)

预处理(2个阶段)Extracting Features and Labels

提取特征和标签 Splitting

分裂Training the Model

训练模型Performance Evaluation

绩效评估

导入数据集(Importing Dataset)

Because I am using Google Colab to perform the experiment, I have downloaded the dataset into my drive and I will now import with pandas.

由于我使用Google Colab进行实验，因此我已将数据集下载到驱动器中，现在将导入熊猫。

import pandas as pdfrom sklearn import preprocessingimport numpy as np# importing datasetdf = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Projects/20200827_Beer_Classifier/recipeData.csv', engine='python')df Original Dataset 原始数据集

Before proceeding further I always have my set of pre-made functions that I can use later on. In this case, I am only using an one_hot encoding. However, I do not want to use a big chunk of code in the middle of pre-processing, it is much more comfortable to only call it using one line of code.

在继续进行下一步之前，我总是拥有一组预制的功能，以后可以使用。在这种情况下，我仅使用one_hot编码。但是，我不想在预处理过程中使用大量代码，只使用一行代码来调用它会更舒适。

# functionsdef one_hot(df, partitions): #togliamo le colonne da X for col in partitions: k = df.pop(col) k = pd.get_dummies(k, prefix=col) df = pd.concat([df, k] , axis=1) return df

前处理 (Preprocessing)

The processing of this dataset has been quite tough. I am dividing the preprocessing into two separate stages.

该数据集的处理非常困难。我将预处理分为两个单独的阶段。

# preprocessing_1#getting rid of columns I do not needdf = df.drop(['BeerID', 'Name', 'StyleID'], axis=1)

After having a look at the Data I got rid of the columns which have too many NaN and BeetID, which is the conversion of Beer Labels into numbers. If I were to use a copy of the labels into the features, the model would be jeopardized.

在查看了数据之后，我摆脱了NaN和BeetID过多的列，这是将Beer Labels转换为数字的过程。如果要在特征中使用标签的副本，则模型将受到威胁。

#getting rid of columns with NaN valuesdf = df.drop(['URL', 'PrimingAmount', 'UserId', 'PrimingMethod', 'PitchRate', 'MashThickness', 'PrimaryTemp'], axis=1)df = df.dropna(axis=0)df

In the second step of preprocessing, I will need to get rid of the rows which contain NaN values. I will remain with approx. 70,000 samples. Then I will need to align the index of both features and labels.

在预处理的第二步中，我将需要除去包含NaN值的行。我将保持大约。 70,000个样本。然后，我需要将要素和标签的索引对齐。

# preprocessing_2#saving one_hot columnsdf_ = df[['SugarScale', 'BrewMethod', 'Style']]df_#transformscaler = preprocessing.MinMaxScaler(feature_range=(0, 1))df = scaler.fit_transform(df.drop(['SugarScale', 'BrewMethod', 'Style'], axis=1))df = pd.DataFrame(df)#standardizziamo datasetsdf.index = df_.indexdf#reattach datasetsdf = pd.concat([df, df_], axis=1)df#one_hotdf = one_hot(df, ['SugarScale', 'BrewMethod'])df#dropnadf = df.dropna(axis=0)df

特征 (Features)

# featuresX = df.copy()X

标签 (Labels)

# labelsy = pd.DataFrame(X.pop('Style'))y

分裂 (Split)

I will now split my model in 80:20 proportions for the train and test set.

我现在将模型以80:20的比例分配给火车和测试装置。

from sklearn.model_selection import train_test_splitX_train , X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

训练模型 (Training the Model)

I will be using Support Vector Machines as a model. This is one of the most common classification models.

我将使用支持向量机作为模型。这是最常见的分类模型之一。

from sklearn import svmclf = svm.SVC(C=1.2, kernel='linear', degree=3)clf.fit(X_train, y_train)

预测 (Prediction)

y_pred is the estimation of the model I just created on the test set.

y_pred是我刚刚在测试集中创建的模型的估计。

y_pred = clf.predict(X_test)

评估表现 (Estimating Performances)

To estimate performances, I will compare y_pred with y_test.

为了评估性能，我将y_pred与y_test进行比较。

from sklearn import metricsprint("Accuracy:", metrics.accuracy_score(y_test, y_pred))#Using SyleID as label#Accuracy: 0.22434812055536743#Using Style as label#Accuracy: 0.22631877481565513

结论 (Conclusion)

The dependent variable is awfully predicted by the independent variable. I have tried with multiple tuning and searched among other results on Kaggle, unfortunately, the highest accuracy score was around 45%.

因变量完全由自变量预测。我尝试了多次调整，并在Kaggle上搜索了其他结果，不幸的是，最高准确度得分约为45％。

翻译自: https://medium.com/towards-artificial-intelligence/discovering-beer-type-from-ingredients-using-classification-b2dd8b41e482

啤酒瓶啤酒盖换啤酒

相关资源：微信小程序源码-合集6.rar

Processed: 0.009, SQL: 8