啤酒瓶啤酒盖换啤酒
In this article, I will analyze the data of beer recipes in a dataset of almost 80,000 samples. With the use of Supervised Learning, I will attempt to estimate the Beer Typology from the Recipe process. The dataset has been downloaded from Kaggle from this link.
在本文中,我将分析将近80,000个样本的数据集中的啤酒配方数据。 通过使用监督学习,我将尝试从配方过程中估算啤酒类型。 该数据集已从Kaggle从此链接下载。
Most of the time creating this model is actually spent preprocessing the data. The dataset is incomplete, and not being a time-series, but rather independent samples, I am unable to use interpolation to solve the problem.
实际上,创建此模型的大部分时间都花在了预处理数据上。 数据集是不完整的,不是时间序列,而是独立的样本,我无法使用插值法解决问题。
Importing Dataset 导入数据集 Creating Functions创建功能Preprocessing (2 stages)预处理(2个阶段)Extracting Features and Labels 提取特征和标签 Splitting分裂Training the Model训练模型Performance Evaluation绩效评估Because I am using Google Colab to perform the experiment, I have downloaded the dataset into my drive and I will now import with pandas.
由于我使用Google Colab进行实验,因此我已将数据集下载到驱动器中,现在将导入熊猫。
import pandas as pdfrom sklearn import preprocessingimport numpy as np# importing datasetdf = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Projects/20200827_Beer_Classifier/recipeData.csv', engine='python')df Original Dataset 原始数据集Before proceeding further I always have my set of pre-made functions that I can use later on. In this case, I am only using an one_hot encoding. However, I do not want to use a big chunk of code in the middle of pre-processing, it is much more comfortable to only call it using one line of code.
在继续进行下一步之前,我总是拥有一组预制的功能,以后可以使用。 在这种情况下,我仅使用one_hot编码。 但是,我不想在预处理过程中使用大量代码,只使用一行代码来调用它会更舒适。
# functionsdef one_hot(df, partitions): #togliamo le colonne da X for col in partitions: k = df.pop(col) k = pd.get_dummies(k, prefix=col) df = pd.concat([df, k] , axis=1) return dfThe processing of this dataset has been quite tough. I am dividing the preprocessing into two separate stages.
该数据集的处理非常困难。 我将预处理分为两个单独的阶段。
# preprocessing_1#getting rid of columns I do not needdf = df.drop(['BeerID', 'Name', 'StyleID'], axis=1)After having a look at the Data I got rid of the columns which have too many NaN and BeetID, which is the conversion of Beer Labels into numbers. If I were to use a copy of the labels into the features, the model would be jeopardized.
在查看了数据之后,我摆脱了NaN和BeetID过多的列,这是将Beer Labels转换为数字的过程。 如果要在特征中使用标签的副本,则模型将受到威胁。
#getting rid of columns with NaN valuesdf = df.drop(['URL', 'PrimingAmount', 'UserId', 'PrimingMethod', 'PitchRate', 'MashThickness', 'PrimaryTemp'], axis=1)df = df.dropna(axis=0)dfIn the second step of preprocessing, I will need to get rid of the rows which contain NaN values. I will remain with approx. 70,000 samples. Then I will need to align the index of both features and labels.
在预处理的第二步中,我将需要除去包含NaN值的行。 我将保持大约。 70,000个样本。 然后,我需要将要素和标签的索引对齐。
# preprocessing_2#saving one_hot columnsdf_ = df[['SugarScale', 'BrewMethod', 'Style']]df_#transformscaler = preprocessing.MinMaxScaler(feature_range=(0, 1))df = scaler.fit_transform(df.drop(['SugarScale', 'BrewMethod', 'Style'], axis=1))df = pd.DataFrame(df)#standardizziamo datasetsdf.index = df_.indexdf#reattach datasetsdf = pd.concat([df, df_], axis=1)df#one_hotdf = one_hot(df, ['SugarScale', 'BrewMethod'])df#dropnadf = df.dropna(axis=0)dfI will now split my model in 80:20 proportions for the train and test set.
我现在将模型以80:20的比例分配给火车和测试装置。
from sklearn.model_selection import train_test_splitX_train , X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)I will be using Support Vector Machines as a model. This is one of the most common classification models.
我将使用支持向量机作为模型。 这是最常见的分类模型之一。
from sklearn import svmclf = svm.SVC(C=1.2, kernel='linear', degree=3)clf.fit(X_train, y_train)y_pred is the estimation of the model I just created on the test set.
y_pred是我刚刚在测试集中创建的模型的估计。
y_pred = clf.predict(X_test)To estimate performances, I will compare y_pred with y_test.
为了评估性能,我将y_pred与y_test进行比较。
from sklearn import metricsprint("Accuracy:", metrics.accuracy_score(y_test, y_pred))#Using SyleID as label#Accuracy: 0.22434812055536743#Using Style as label#Accuracy: 0.22631877481565513The dependent variable is awfully predicted by the independent variable. I have tried with multiple tuning and searched among other results on Kaggle, unfortunately, the highest accuracy score was around 45%.
因变量完全由自变量预测。 我尝试了多次调整,并在Kaggle上搜索了其他结果,不幸的是,最高准确度得分约为45%。
翻译自: https://medium.com/towards-artificial-intelligence/discovering-beer-type-from-ingredients-using-classification-b2dd8b41e482
啤酒瓶啤酒盖换啤酒
相关资源:微信小程序源码-合集6.rar