After getting a lot of traction on my previous blog on full stack data science: The Next Gen of Data Scientists Cohort, I have decided to start a blog series on Data Science in production. This series will go over the basics of the tech-stack and techniques that you can get familiarized with to face the real data science industry for specializations such as Machine Learning, Data Engineering, and ML Infrastructure. It will be a walkthrough of how you can take your academic projects to the next level by deploying your models and creating ml pipelines with best practices used in the industry.
在上一篇有关全栈数据科学的博客(下一代数据科学家队列)吸引了很多注意力之后 ,我决定开始制作有关数据科学的博客系列。 本系列文章将介绍您可以熟悉的技术堆栈和技术基础知识,以面对真正的数据科学行业的专业知识,例如机器学习,数据工程和ML基础架构。 这将是如何通过部署模型并创建具有行业最佳实践的ml管道来将学术项目提高到一个新水平的演练。
As simple as it may sound, but It’s very different from practicing data science for your side projects or academic projects than how they do in the industry. It requires a lot more in terms of code complexity, code organization, and data science project management. In this first part of the series, I will be taking you guys through how to serve your ML models by building APIs so that your internal teams could use it or any other folks outside your organization could use it.
听起来似乎很简单,但是与在副项目或学术项目中实践数据科学和在行业中进行实践截然不同。 在代码复杂性,代码组织和数据科学项目管理方面,它需要更多的资源。 在本系列的第一部分中,我将带大家学习如何通过构建API为ML模型提供服务,以便内部团队可以使用它,或者组织外部的任何其他人都可以使用它。
In classrooms, we generally do take a dataset from Kaggle, do preprocessing on it, do exploratory analysis and build models to predict some or the other thing. Now, Let’s take it to the next level by packaging that model that you built and the preprocessing on the data that you did into a REST API. Huh, what is a REST API? Wait, I am going to go over everything in detail soon. Let’s start by defining what we will be using and the technology behind it.
在教室里,我们通常会从Kaggle提取数据集,对其进行预处理,进行探索性分析,并建立模型来预测某些或其他事情。 现在,通过将构建的模型和对数据所做的预处理打包到REST API中,将其提升到一个新的水平。 呵呵,什么是REST API? 等等,我将很快详细介绍所有内容。 让我们从定义将要使用的内容及其背后的技术开始。
API is Application Programming Interface which basically means that it is a computing interface that helps you interact with multiple software intermediaries. What is REST? REST is Representational State Transfer and it is an software architecture style. Let me just show you in a simple diagram what I am talking about:
API是应用程序编程接口,基本上意味着它是一个计算接口,可以帮助您与多个软件中介进行交互。 什么是REST? REST是代表性状态转移,它是一种软件体系结构样式。 让我以简单的图表向您展示我在说什么:
Image from astera.com 图片来自astera.comSo, the Client can interact with your system in our case to get predictions by using our built models, and they don’t need to have any of the libraries or models that we built. It’s just become easier to showcase your projects if you are appearing for interviews or applying to higher education. It’s something that they can see working rather than three lines of shit written on your resume blah blah blah.
因此,在我们的案例中,客户端可以与您的系统进行交互,以使用我们构建的模型来获得预测,而他们不需要拥有我们构建的任何库或模型。 如果您要参加面试或申请高等教育,展示您的项目变得更加容易。 他们能看到的是有效的东西,而不是简历上写的三行狗屎等等。
Here we will be building our API that will serve our machine learning model, and we will be doing all that in FLASK. Flask is again a web framework for python. You must have heard about two substantial names in the industry which is Flask and Django. Flask and Django are both amazing web frameworks for python, but when It comes to building APIs, Flask is super fast due to it’s less complicated and minimal design. Wohoo! So, we will be again going through something which is prevalently used in the industry. Without wasting more of your time, let’s start grinding some code and build our API for serving the ML model.
在这里,我们将构建将为我们的机器学习模型服务的API,并且我们将在FLASK中进行所有操作。 Flask再次是python的Web框架。 您一定听说过该行业中的两个重要名称,分别是Flask和Django。 Flask和Django都是适用于python的出色Web框架,但是在构建API时,由于Flask的复杂性和最小化设计使其速度超快。 哇! 因此,我们将再次进行行业中普遍使用的操作。 在不浪费您更多时间的情况下,让我们开始研究一些代码并构建用于服务ML模型的API。
First step always would be to setup your own project environment so that you can isolate your project libraries and their versions from interacting the local python environment. There are two ways in which you can setup your python environment for your project specifically: Virtualenv and Conda. Just to be on the same page I will be using Python 3.8.3 for this entire project but you can use any version and that should be fine.
第一步总是要设置您自己的项目环境,以便您可以将项目库及其版本与本地python环境进行隔离。 您可以通过两种方式专门为项目设置python环境: Virtualenv和Conda 。 只是在同一页面上,我将在整个项目中使用Python 3.8.3,但是您可以使用任何版本,应该没问题。
#installing virtualenv pip3 install virtualenv #creating your own virtual environment named mlapi python3 -m venv mlapi #activating your virtual environment source mlapi/bin/activate #create conda create -n mlapi python=3.8.3 #activate conda activate mlapiAfter any of the above commands in your terminal. you will be in your project’s own virtual environment. if you want to install anything in the virtual environment than its as simple as the normal pip install. It’s always a standard practice in the industry to create virtual environments while you are working on any of the projects.
在终端中执行上述任何命令之后。 您将处于项目自己的虚拟环境中。 如果您想在虚拟环境中安装任何东西,都比普通pip安装更简单。 在您处理任何项目时,创建虚拟环境始终是行业中的标准做法。
Once you are in the virtual environment, use the requirements.txt from the github repo: https://github.com/jkachhadia/ML-API
进入虚拟环境后,请使用github存储库中的requirements.txt: https : //github.com/jkachhadia/ML-API
make sure you copy the requirements.txt file from the repo to your project folder as we will be using it later and I will also show you how you can create your own requirements.txt file. After copying the file to your project folder and making sure that you are in the environment that you just created, run the following commands in your terminal to install all the dependencies you need for the project.
确保您将仓库中的requirements.txt文件复制到您的项目文件夹中,以便我们稍后使用,并且我还将向您展示如何创建自己的requirements.txt文件。 将文件复制到项目文件夹并确保您处于刚刚创建的环境中之后,请在终端中运行以下命令以安装项目所需的所有依赖项。
pip install -r requirements.txtNow, you are all set.
现在,您已准备就绪。
For this project, our main aim is to package and deploy our built ML model in the form of an API. So, we will be using the Kaggle’s starter Titanic dataset and a basic logistic regression model with feature engineering to build our model. If you want to know how I built the basic model. The code can be found on this Github repo. You can find the code in the model_prep.ipynb ipython notebook(assuming you are familiar with ipython notebooks). The code is inspired by one of the kaggle kernels that I found as that’s not the main goal over here.
对于此项目,我们的主要目的是以API的形式打包和部署我们构建的ML模型。 因此,我们将使用Kaggle的入门级Titanic数据集和具有特征工程的基本逻辑回归模型来构建模型。 如果您想知道我如何构建基本模型。 可以在此Github存储库上找到该代码。 您可以在model_prep.ipynb ipython笔记本中找到代码(假设您熟悉ipython笔记本)。 该代码的灵感来自我发现的kaggle内核之一,因为这不是此处的主要目标。
We will be using the pickle library to save the model. Your model, in turn, is a python object with all the equations and hyper-parameters in place, which can be serialized/converted into a byte stream with pickle.
我们将使用pickle库保存模型。 反过来,您的模型是一个具有所有方程式和超参数的python对象,可以使用pickle将其序列化/转换为字节流。
import pickle #name the file in which model will be saved name="final_model.sav" #save your model in that file pickle.dump(model, open(name, 'wb')) #load back the model loaded_model = pickle.load(open(filename, 'rb'))The above code will be found in the model_prep notebook as well.
上面的代码也可以在model_prep笔记本中找到。
We will now create a Flask API with best practices. what best practices man? like how to create a clean code that can be shipped to production and easy to debug if any issues occur.
现在,我们将创建具有最佳实践的Flask API。 男人有什么最佳实践? 例如如何创建可用于生产的干净代码,并在出现任何问题时易于调试。
So, first, we will create a helper_functions python script which has all the preprocessing modules we will need. Again, this is the same preprocessing code that you will find in the model_prep notebook but we are creating functions out of it to be reused anywhere else.
因此,首先,我们将创建一个helper_functions python脚本,其中包含我们需要的所有预处理模块。 同样,这与在model_prep笔记本中可以找到的预处理代码相同,但是我们正在使用它创建函数以在其他任何地方重用。
import pandas as pd import numpy as np from sklearn.preprocessing import LabelEncoder def preprocess(data_all): data_all['Age'].fillna(data_all['Age'].median(), inplace = True) #complete embarked with mode data_all['Embarked'].fillna(data_all['Embarked'].mode()[0], inplace = True) #complete missing fare with median data_all['Fare'].fillna(data_all['Fare'].median(), inplace = True) #delete the cabin feature/column and others previously stated to exclude in train dataset drop_column = ['PassengerId','Cabin', 'Ticket'] data_all.drop(drop_column, axis=1, inplace = True) #Discrete variables data_all['FamilySize'] = data_all['SibSp'] + data_all['Parch'] + 1 data_all['IsAlone'] = 1 #initialize to yes/1 is alone data_all['IsAlone'].loc[data_all['FamilySize'] > 1] = 0 # now update to no/0 if family size is greater than 1 #quick and dirty code split title from name: http://www.pythonforbeginners.com/dictionary/python-split data_all['Title'] = data_all['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0] #Continuous variable bins; qcut vs cut: https://stackoverflow.com/questions/30211923/what-is-the-difference-between-pandas-qcut-and-pandas-cut #Fare Bins/Buckets using qcut or frequency bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html data_all['FareBin'] = pd.qcut(data_all['Fare'], 4) #Age Bins/Buckets using cut or value bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html data_all['AgeBin'] = pd.cut(data_all['Age'].astype(int), 5) #cleanup rare title names #print(data1['Title'].value_counts()) stat_min = 10 #while small is arbitrary, we'll use the common minimum in statistics: http://nicholasjjackson.com/2012/03/08/sample-size-is-10-a-magic-number/ title_names = (data_all['Title'].value_counts() < stat_min) #this will create a true false series with title name as index #apply and lambda functions are quick and dirty code to find and replace with fewer lines of code: https://community.modeanalytics.com/python/tutorial/pandas-groupby-and-python-lambda-functions/ data_all['Title'] = data_all['Title'].apply(lambda x: 'Misc' if title_names.loc[x] == True else x) #code categorical data label = LabelEncoder() data_all['Sex_Code'] = label.fit_transform(data_all['Sex']) data_all['Embarked_Code'] = label.fit_transform(data_all['Embarked']) data_all['Title_Code'] = label.fit_transform(data_all['Title']) data_all['AgeBin_Code'] = label.fit_transform(data_all['AgeBin']) data_all['FareBin_Code'] = label.fit_transform(data_all['FareBin']) return data_allNow, we will not hard-code the variables or names that we will be using in our final API script. So, we can create a separate python file named configs.py which will basically store all our variables for security purposes.
现在,我们将不会在最终的API脚本中硬编码将要使用的变量或名称。 因此,我们可以创建一个单独的名为configs.py的python文件,该文件将出于安全目的基本上存储所有变量。
cols=['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code'] model_name="final_model.sav"Let's start building our API. we will start with a simple one: just a new version of hello world. Create a new file named app.py and let's import all the libraries we will need for getting our API up and running.
让我们开始构建我们的API。 我们将从一个简单的例子开始:只是hello world的新版本。 创建一个名为app.py的新文件,让我们导入启动和运行API所需的所有库。
from flask import Flask, jsonify, request, make_response import pandas as pd import numpy as np import sklearn import pickle import json from configs import * from helper_functions import preprocesswe have imported all the libraries in the above code as well as all the helper functions and configs with variables. let's initialize a flask application instance now.
我们已经导入了上面代码中的所有库,以及带有变量的所有辅助函数和配置。 让我们现在初始化一个flask应用程序实例。
app= Flask(__name__)To start with, Let's write a simple flask type hello world and create a new route for our flask application. Routes are generally the urls which will be supported by functions.
首先,让我们编写一个简单的flask类型的hello world并为我们的flask应用程序创建一条新路径。 路由通常是功能支持的网址。
@app.route("/", methods=["GET"]) def hello(): return jsonify("hello from ML API of Titanic data!")Congrats! you wrote your first flask route. Now let's get this running by running the app object that we initiated with Flask.
恭喜! 你写了你的第一个烧瓶路线。 现在,让我们运行通过Flask初始化的应用程序对象来运行它。
if __name__=='__main__': app.run(debug=True)yes! we are kinda done with our first mini gig. Let's run this on our local. Open your terminal and run app.py (make sure you are in the project folder where app.py is there and you are in the virtual environment which we created before)
是! 我们的第一场迷你演出已经完成了。 让我们在本地运行它。 打开终端并运行app.py(确保您位于app.py所在的项目文件夹中,并且位于我们之前创建的虚拟环境中)
python app.pywoohoo! our Flask app should be running on http://127.0.0.1:5000. If you go to that url using your browser. we should get the message that we added in the first route: “hello from ML API of Titanic data!”. Super Cool!
呜呼! 我们的Flask应用程序应在http://127.0.0.1:5000上运行。 如果使用浏览器转到该URL。 我们应该得到在第一条路线中添加的消息:“来自泰坦尼克号数据的ML API,您好!”。 超酷!
Let's start building our new route which will be our way of exposing our ML model.
让我们开始构建新路线,这将是我们公开ML模型的方式。
@app.route("/predictions", methods=["GET"]) def predictions(): data = request.get_json() df=pd.DataFrame(data['data']) data_all_x_cols = cols try: preprocessed_df=preprocess(df) except: return jsonify("Error occured while preprocessing your data for our model!") filename=model_name loaded_model = pickle.load(open(filename, 'rb')) try: predictions= loaded_model.predict(preprocessed_df[data_all_x_cols]) except: return jsonify("Error occured while processing your data into our model!") print("done") response={'data':[],'prediction_label':{'survived':1,'not survived':0}} response['data']=list(predictions) return make_response(jsonify(response),200)In our new route above with added predictions/, what happens is if someone sends a get request to this URL of our flask application along with raw data in the form of JSON, we will preprocess the data the same way we did for creating the model, get predictions and send back the prediction results.
在上面添加了预测/的新路线中,如果有人将获取请求连同原始数据以JSON形式发送到flask应用程序的此URL,则发生的事情是,我们将以与创建模型相同的方式对数据进行预处理,获取预测并发送回预测结果。
request.get_json() will basically give us the JSON data that was sent with the get request. We convert that data into a dataframe, use our helper function preprocess() to preprocess the dataframe, use the model_name and column names from the config file to basically load the model with pickle and make predictions on the sliced dataframe.
request.get_json()基本上将为我们提供与get请求一起发送的JSON数据。 我们将数据转换为数据框,使用我们的辅助函数preprocess()预处理数据框,使用配置文件中的model_name和列名基本用pickle加载模型,并对切片的数据框进行预测。
After making the predictions, we will create a response dictionary that contains predictions and prediction label metadata and finally convert that to JSON using jsonify and return the JSON back. Remember that 200 is sent as it was a success. Once you save app.py after editing, the flask application, which is still running, will automatically update its backend to incorporate a new route.
做出预测后,我们将创建一个包含预测和预测标签元数据的响应字典,最后使用jsonify将其转换为JSON并返回JSON。 请记住,成功发送了200个。 在编辑后保存app.py之后,仍然在运行的flask应用程序将自动更新其后端以合并新的路由。
To test our API on local we will just write a small ipython notebook or you can use one in the github repo as well named testapi.ipynb
为了在本地测试我们的API,我们将只编写一个小型的ipython笔记本,或者您可以在github存储库中使用一个名为testapi.ipynb的笔记本。
import requests import pandas as pd import numpy as np import json #reading test data data=pd.read_csv('./input/test.csv') #converting it into dictionary data=data.to_dict('records') #packaging the data dictionary into a new dictionary data_json={'data':data} #defining the header info for the api request headers = { 'content-type': "application/json", 'cache-control': "no-cache", } #making the api request r=requests.get(url='http://127.0.0.1:5000/predictions',headers=headers,data=json.dumps(data_json)) #getting the json data out data=r.json() #displaying the data print(data)If you run the above code in your python terminal or ipython notebook, you will see that your API is working like magic. Hurray! You have successfully exposed your model but locally :(
如果您在python终端或ipython笔记本中运行以上代码,您将看到您的API就像魔术一样工作。 欢呼! 您已经成功展示了模型,但是在本地:(
Heroku is a cloud platform that helps you deploy backend applications on their cloud. Yes, we will be deploying our ML model API now in the cloud.
Heroku是一个云平台,可帮助您在其云上部署后端应用程序。 是的,我们现在将在云中部署ML模型API。
Let's get started. Create your account on heroku.com. Once you do that and go to the dashboard you will have to create a new app.
让我们开始吧。 在heroku.com上创建您的帐户。 完成此操作并转到仪表板后,您将必须创建一个新应用。
Heroku dashboard upper part Heroku仪表盘上部You click on create new app and name it accordingly as I named mine ‘mlapititanic’
您单击创建新应用,并将其命名为我的名字“ mlapititanic”
mine is not available because I already created it :p 我的不可用,因为我已经创建了它:pAwesome! Now, you can click on your app, go to settings and add python to your buildpack section.
太棒了! 现在,您可以单击您的应用程序,进入设置并将python添加到您的buildpack部分。
You can do this the other way as well by installing the Heroku CLI which we would have to eventually do to deploy our application.
您也可以通过安装Heroku CLI来进行其他操作,这是我们最终要部署应用程序所必须要做的。
After installing the CLI you can also create an app from the command line as shown below:
安装CLI后,您还可以从命令行创建一个应用程序,如下所示:
heroku create myapp --buildpack heroku/pythonI love the CLI way as I have been an Ubuntu/Mac person since 5 years now.
我喜欢CLI方式,因为自从5年以来我一直是Ubuntu / Mac使用者。
Now we will add two files which is the Procfile and runtime.txt to the folder.
现在,我们将在文件夹中添加两个文件,分别是Procfile和runtime.txt。
web: gunicorn app:app --log-file=- python-3.8.3Procfile will basically run your app with gunicorn. make sure you have that installed in your virtual environment. Now, As I told you we will go through how you can create your own requirements.txt file.
Procfile基本上将使用gunicorn运行您的应用程序。 确保已在虚拟环境中安装了该软件。 现在,正如我告诉您的那样,我们将逐步介绍如何创建自己的requirements.txt文件。
pip freeze>requirements.txtThis will basically dump all your app/virtual environment’s dependencies into a requirements.txt file.
基本上,这会将您所有应用程序/虚拟环境的依赖项转储到requirements.txt文件中。
Now, If you go to the deploy section of heroku, they have super clear instructions written there about how to deploy but I will put them below.
现在,如果您进入heroku的部署部分,他们那里有关于如何部署的超清晰说明,但是我将在下面进行介绍。
#login into the heroku cli $ heroku login #getting your heroku remote setup $ cd my-project/ $ git init $ heroku git:remote -a <your-app-name> #pushing your entire ML API Flask application into production $ git add . $ git commit -am "make it better" $ git push heroku masterThese commands will push your code to the heroku cloud and build your flask application with dependencies. Congratulations! you have deployed your ML API into cloud/production.
这些命令会将您的代码推送到heroku云,并使用依赖项构建flask应用程序。 恭喜你! 您已将ML API部署到云/生产中。
Now you can go to https://<your-app-name>.herokuapp.com/ and you will see a hello from the app as we saw on the local.
现在,您可以转到https:// <您的应用程序名称> .herokuapp.com /,您将看到应用程序中的问候,就像我们在本地看到的一样。
Now we will test the deployed API!
现在,我们将测试已部署的API!
import requests import pandas as pd import numpy as np import json #reading test data data=pd.read_csv('./input/test.csv') #converting it into dictionary data=data.to_dict('records') #packaging the data dictionary into a new dictionary data_json={'data':data} #defining the header info for the api request headers = { 'content-type': "application/json", 'cache-control': "no-cache", } #making the api request r=requests.get(url='https://<your-app-name>.herokuapp.com/predictions',headers=headers,data=json.dumps(data_json)) #getting the json data out data=r.json() #displaying the data print(data)If it’s running. You are all set! Woohoo!
如果正在运行。 你们都准备好了! hoo!
Thus, we built our very own ML model API with best practices used in the industry and this could be used in your other projects or you could showcase it on your resume rather than just putting in what you did like you use to. This is something live and someone can play with and proof of something that you have really built.
因此,我们使用行业中的最佳实践构建了自己的ML模型API,并且可以在您的其他项目中使用它,或者您可以在履历中展示它,而不仅仅是像以前那样做。 这是现场活动,有人可以玩耍并证明您真正建造的东西。
Shoot your questions on [myLastName][myFirstName] at gmail dot com or let’s connect on LinkedIn.
在gmail点com上的[myLastName] [myFirstName]上提问,或者在LinkedIn上连接。
翻译自: https://towardsdatascience.com/data-science-in-production-building-flask-apis-to-serve-ml-models-with-best-practices-997faca692b9
相关资源:部署ML模型-源码