kubeflow

科技2025-03-15 44

kubeflow

Hi there, this is Allen from Zeals Japan. I work as a SRE / gopher which mainly responsible for microservices development.

嗨，我是Zeals Japan的Allen。我以SRE / gopher的身份工作，主要负责微服务开发。

背景 (Background)

故事(Story)

Nowadays machine learning is everywhere and we do believe it will still be trending in the next few years. Data scientists are working on large datasets on a daily basis to develop models that help the business in different areas.

如今，机器学习无处不在，我们相信在未来几年中它将继续发展。数据科学家每天都在处理大型数据集，以开发可在不同领域帮助业务的模型。

怎么了？ (What’s wrong?)

We are not an exception as our machine learning team is working on different datasets and across multiple areas including deep learning (DL), natural language processing (NLP) and behaviour prediction to improve our product, but as we are handling massive amounts of data, they soon realized working on local or using cloud provider notebook like colab or kaggle dragging down their productivity significantly.

我们也不例外，因为我们的机器学习团队正在研究不同的数据集，并跨越多个领域，包括深度学习(DL)，自然语言处理(NLP)和行为预测，以改善我们的产品，但是由于我们正在处理大量数据，他们很快意识到在本地工作或使用像colab或kaggle这样的云服务提供商笔记本会大大降低他们的生产力。

Unable to scale and secure more resources when they are handling heavier workload

当他们处理更大的工作负载时，无法扩展和保护更多资源 Limited access to GPU

对GPU的访问受限If you are using cloud notebook result will not persist automatically but will reset once you idle or exit

如果您使用的是Cloud Notebook，结果将不会自动保留，但在您空闲或退出后将重置Hard to share notebook to with your co-workers

很难与同事共享笔记本No way to use a custom image, need to setup the environment every time

无法使用自定义图像，每次都需要设置环境Hard to share common dataset among the team

团队之间难以共享通用数据集

研究中(Researching)

当前实施(Current Implementation)

Originally we were using a helm chart to install a jupyterhub on our kubernetes cluster and we were having a hard time managing the resources and shared datasets using shared volume.

最初，我们使用helm图在kubernetes集群上安装了一个jupyterhub，并且我们很难使用共享卷来管理资源和共享数据集。

As a part of the infrastructure team, we need to adjust the resources frequently for the ML team which is not so ideal and obviously dragging down both team’s productivity.

作为基础架构团队的一部分，我们需要为ML团队频繁地调整资源，这不是那么理想，并且显然拖累了两个团队的生产力。

工具？ (Tools?)

There are multiple solutions available in the community kubespawner and helm chart we originally used.

我们最初使用的社区kubespawner和helm图表中有多种解决方案。

kubespawner

Stars: 328 (2020–08–12)

星： 328 (2020–08–12)

Pros

优点

Able to spawn multiple notebook deployment separated by namespace

能够产生由名称空间分隔的多个笔记本部署Extremely customizable configuration based on python API

基于python API的极其可定制的配置Ability to mount different volumes to different notebook deployment

能够将不同的卷装载到不同的笔记本部署中

Cons

缺点

Community is small

社区很小Lacking support on cloud native features, such as setting up network, kubernetes cluster, permission etc still need to handle manually

缺乏对云本机功能的支持，例如设置网络，kubernetes集群，权限等，仍需要手动处理Lacking support of authorization

缺乏授权支持

zero-to-jupyterhub-k8s

零到jupyterhub-k8s

Stars: 740 (2020–08–12)

星星： 740 (2020–08–12)

Pros

优点

Official support, helm chart that’s published by jupyterhub

官方支持，jupyterhub发布的头盔图表Easy to setup and manage by helm

易于通过头盔进行设置和管理With good authorization support such as github and google OAuth

具有良好的授权支持，例如github和google OAuth

Cons

缺点

Limited support on individual namespace

对单个名称空间的支持有限Hard to declare and mount volumes based on notebook usage

难以根据笔记本电脑的使用情况声明和挂载卷Lacking support on cloud native features, such as setting up network, kubernetes cluster, permission etc still need to handle manually

缺乏对云本机功能的支持，例如设置网络，kubernetes集群，权限等，仍需要手动处理

Kubeflow

Stars: 9.2k (2020–08–12)

星星： 9.2k (2020–08–12)

Pros

优点

Good support from both author and community

得到作者和社区的良好支持Good support for different cloud platform

对不同云平台的良好支持Not limited to notebook, but also got other tools that help with machine learning process such as pipeline and hyperparameters tuning

不仅限于笔记本，还提供了其他有助于机器学习过程的工具，例如管道和超参数调整Able to easily separate namespace between different user without changing any code

能够轻松区分不同用户之间的名称空间，而无需更改任何代码Can easily mount multiple volumes based on the notebook usage

可以根据笔记本电脑的使用情况轻松装载多个卷Dynamic GPU support

动态GPU支持

Cons

缺点

Very huge stack and hard to understand and customize

非常庞大的堆栈，难以理解和定制Need to be in an individual cluster, running cost is higher

需要在单个集群中，运行成本较高Steep learning curve, compare to a plain notebook, using Kubeflow also requires you to have knowledge of kubernetes when you using tools like pipelines and hyperparameters tuning

与普通笔记本相比，陡峭的学习曲线，使用Kubeflow时还要求您在使用管道和超参数调整之类的工具时了解kubernetes。

我们选择什么？ (What We Chose?)

Kubeflow is our pick. From the above comparison, we can easily see that Kubeflow has got so many features that I think we will need in the future. The entire solution also came as a box so looks like it can be setup quite easily.

Kubeflow是我们的选择。通过以上比较，我们可以轻松地看到Kubeflow拥有许多我认为将来需要的功能。整个解决方案还作为一个盒子提供，因此看起来很容易设置。

I quickly found that this might be the one I am looking for and I can’t wait to try it.

我很快发现这可能是我正在寻找的那个，我等不及要尝试了。

Recently they released the first stable version 1.0 back in march, and I think it’s a good time for us to try it.

最近，他们在3月发布了第一个稳定的版本1.0，我认为现在是我们尝试它的好时机。

安装 (Installation)

首先尝试！(Try it out first!)

At this stage, I haven’t decided to proceed with Kubeflow, but as an infrastructure person, We always need to test the tool before we introduce it to others.

在此阶段，我还没有决定继续使用Kubeflow，但是作为基础架构人员，我们始终需要先测试该工具，然后再将其介绍给其他人。

The installation is quite simple for Kubeflow if you are running on cloud. They have out of the box installation script for each cloud provider and you just need to simply run it.

如果您在云上运行，则对于Kubeflow而言，安装非常简单。他们为每个云提供商提供了开箱即用的安装脚本，您只需要简单地运行它即可。

设置项目 (Setting up the project)

Since we are running on GCP, I’ll use that as an example, but you can also find the cloud provider you are using or even if you are hosting on-premise cluster you will find your page here.

由于我们在GCP上运行，因此将以它为例，但是您也可以找到您正在使用的云提供商，或者即使您托管本地群集，也可以在此处找到您的页面。

It’s good for you to create a new GCP project when you trying on something so it will be isolated from other environment.

当您尝试进行某些操作时，最好创建一个新的GCP项目，以便将其与其他环境隔离。

gcloud projects create kubeflow

安装CLI(Installing CLI)

Following the steps here to setup OAuth so Kubeflow CLI can get access to the GCP resources.

请按照此处的步骤设置OAuth，以便Kubeflow CLI可以访问GCP资源。

First we need to install the Kubeflow CLI, you can find the latest binary on github releases

首先，我们需要安装Kubeflow CLI，您可以在github版本上找到最新的二进制文件

tar -xvf kfctl_v1.0.2_<platform>.tar.gz && mv kfctl /usr/local/bin/

设置GCP基础(Setup the GCP bases)

After that just some standard gcloud configuration

之后，只是一些标准的gcloud配置

# Set your GCP project ID and the zone where you want to create # the Kubeflow deployment: export PROJECT=kubeflow export ZONE=asia-east-1 export CLIENT_ID=<CLIENT_ID from OAuth page> export CLIENT_SECRET=<CLIENT_SECRET from OAuth page> gcloud config set project ${PROJECT} gcloud config set compute/zone ${ZONE}

Note that multi-zone is not yet supported if you want to use GPU, we’re using asia-east-1 here since it is the only region that has K80 GPU support right now (2020 July).

请注意，如果您要使用GPU，则尚不支持多区域，我们在这里使用asia-east-1 ，因为它是当前唯一支持K80 GPU的区域(2020年7月)。

启动集群 (Spinning up the cluster)

Spinning up the cluster simply running

只需运行即可启动集群

export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_gcp_iap.v1.0.2.yaml" export KF_NAME=kubeflow export BASE_DIR=$(pwd) export KF_DIR=${BASE_DIR}/${KF_NAME} mkdir -p ${KF_DIR} cd ${KF_DIR} kfctl apply -V -f ${CONFIG_URI}

定制部署(Customizing the deployment)

For simplicity we are directly applying here, but if you want to customize the manifest it’s also possible by running

为简单起见，我们直接在此处应用，但是如果您要自定义清单，也可以通过运行

export CONFIG_FILE="kfdef.yaml" curl -L -o ${CONFIG_FILE} https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_gcp_iap.v1.0.2.yaml kfctl build -V -f ${CONFIG_FILE} # modify you manifest kfctl apply -V -f ${CONFIG_FILE}

验证安装(Verify the installation)

Get the kube context

获取kube上下文

gcloud container clusters get-credentials ${KF_NAME} --zone ${ZONE} --project ${PROJECT} kubectl -n kubeflow get all

访问UI(Accessing the UI)

Kubeflow will automatically generate an endpoint in this format, it can take a few minutes before it is accessible.

Kubeflow将自动以这种格式生成一个终结点，可能需要花费几分钟时间才能访问它。

kubectl -n istio-system get ingress # NAME HOSTS ADDRESS PORTS AGE # envoy-ingress your-kubeflow-name.endpoints.your-gcp-project.cloud.goog 34.102.232.34 80 5d13h # https://<KF_NAME>.endpoints.<project-id>.cloud.goog/

That’s all, pretty easy! Now you can access the link and check on UI

就是这么简单！现在您可以访问链接并检查UI

设置笔记本服务器创建笔记本服务器 (Setting up notebook serverCreating the notebook server)

Navigate to Notebook servers -> New server

导航到Notebook servers -> New server

You can see there are tons of configurations we can make!

您可以看到我们可以进行大量的配置！

笔记本服务器的设置 (Settings for the notebook server)

Breaking down a bit

分解了一下

Image

图片

Able to use prebuilt tensorflow notebook server or custom notebook server image

能够使用预建的Tensorflow笔记本服务器或自定义笔记本服务器映像You can install and prebuilt common dependencies images and everyone now has access to the same setup!

您可以安装和预先构建通用的依赖项映像，现在每个人都可以访问相同的设置！

CPU / RAM

CPU /内存

Workspace volume

工作区体积Each notebook create a new workspace volume by default. This ensure you won’t loss your process if you are away or the pod accidentally shutdown

每个笔记本默认情况下都会创建一个新的工作区卷。这样可以确保在您不在时或吊舱意外关闭时不会丢失过程

You can even share workspace volume if you configure it to ReadWriteMany if you want to share with you team as well

如果将工作空间卷配置为ReadWriteMany甚至可以与团队共享，甚至可以共享

Data volumes

数据量

Now it’s super easy to share dataset by just using data volumes, scientists just need to choose which dataset they want to use

现在，仅使用数据量即可轻松共享数据集，科学家只需选择要使用的数据集You can even mount multiple dataset in a same notebook server

您甚至可以在同一台笔记本服务器上装载多个数据集

Configurations

构型

This is used for store credentials or secrets, if you are using GCP, the list will default with Google credentials so you can access gcloud command or SQL dataset / big query to access even more data.

这用于存储凭据或机密，如果您使用的是GCP，则列表将默认使用Google凭据，因此您可以访问gcloud命令或SQL数据集/大查询来访问更多数据。

GPUs

GPU

Now it’s dynamic! but don’t forget to turn it off one you finished using, otherwise it may blow up your bill!

现在是动态的！但不要忘了将它用完后再将其关闭，否则可能会浪费您的账单！

运行我们的第一个实验 (Running our first experiment)

建立(Setup)

I created a notebook server with all default parameters, running on tensorflow-2.1.0-notebook-cpu:1.0.0 image.

我创建了一个带有所有默认参数的笔记本服务器，并在tensorflow-2.1.0-notebook-cpu:1.0.0映像上运行。

I want to build a simple salary prediction model following fastai tutorial.

我想按照fastai教程建立一个简单的薪资预测模型。

Since the image doesn’t come with fastai, simply install it

由于该图像不是fastai随附的，因此只需安装它

! pip install --user --upgrade pip ! pip install fastai

训练模型(Training the model)

We simply copy the code from the tutorial

我们只需复制教程中的代码

from fastai.tabular import * path = untar_data(URLs.ADULT_SAMPLE) df = pd.read_csv(path/'adult.csv') df.to_csv('./adult.csv') # saving it for later usage procs = [FillMissing, Categorify, Normalize] valid_idx = range(len(df)-2000, len(df)) dep_var = 'salary' cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country'] data = TabularDataBunch.from_df(path, df, dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_names) learn = tabular_learner(data, layers=[200,100], emb_szs={'native-country': 10}, metrics=accuracy) learn.fit_one_cycle(1, 1e-2)

We successfully trained a model!

我们成功地训练了模型！

| epoch | train_loss | valid_loss | accuracy | time | | 0 | 0.319807 | 0.321294 | 0.847000 | 00:06 |

We can simply save the current checkpoint in the workspace and retrain it next time!

我们可以简单地将当前检查点保存在工作区中，并在下次重新训练它！

torch.save(learn.model.state_dict(), 'prediction.pth')

超参数调整(Hyperparameters Tuning)

卡蒂布(Katib)

Not limited to notebook servers, Kubeflow also has tons of other modules that are very convenient to data scientists, katib is one of the modules that you can use.

Kubeflow不仅限于笔记本服务器，还具有大量其他模块，这些模块对于数据科学家来说非常方便，而katib是您可以使用的模块之一。

Katib provides both Hyperparameter Tuning and Neural Architecture Search, we will try out hyperparameter tuning here.

Katib同时提供了超参数Hyperparameter Tuning和Neural Architecture Search ，我们将在这里尝试进行超参数调整。

写工作 (Writing the job)

Using katib is extremely easy, if you are familiar with Kubernetes manifests it will be even easier for you. Katib uses Job on Kubernetes and repeatedly runs your job until it hits the target value of maximum runs.

使用katib非常容易，如果您熟悉Kubernetes的清单，它将对您更加轻松。 Katib在Kubernetes上使用Job并重复运行您的作业，直到达到最大运行次数的目标值。

We will use the same model salary prediction but this time we do want to tune those input values.

我们将使用相同的模型salary prediction但是这次我们要调整这些输入值。

训练脚本 (Training script)

import argparse from fastai.tabular import * @dataclass class MetricCallback(Callback): def on_epoch_end(self, **kwargs:Any): super().on_epoch_end(**kwargs) epoch = kwargs.get('epoch') acc = kwargs.get('last_metrics')[1].detach().item() print(f'epoch: {epoch}') print(f'accuracy={acc}') if __name__ == "__main__": parser = argparse.ArgumentParser(description='Process some integers.') parser.add_argument('--lr', type=float, help='Leanring rate') parser.add_argument('--num_layers', type=int, help='Layers') parser.add_argument('--emb_szs', type=int, help='Layers') args = parser.parse_args() df = pd.read_csv('./adult.csv') path = "./" procs = [FillMissing, Categorify, Normalize] valid_idx = range(len(df)-2000, len(df)) dep_var = 'salary' cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country'] data = TabularDataBunch.from_df(path, df, dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_names) # metric collecting from stdout # epoch 1: # loss=0.3 # recall=0.5 # precision=0.4 learn = tabular_learner(data, layers=[200,args.num_layers], emb_szs={'native-country': args.emb_szs}, metrics=accuracy) learn.fit_one_cycle(5, args.lr, callbacks=[MetricCallback()])

收集指标(Collecting metrics)

Katib will automatically collect the train metrics from stdout, so we only need to print it out.

Katib将自动从stdout收集训练指标，因此我们仅需打印即可。

in the args we pass lr, num_layers and emb_szs as hyper parameters.

在args中，我们将lr ， num_layers和emb_szs作为超参数传递。

工作定义 (Job definition)

apiVersion: "kubeflow.org/v1alpha3" kind: Experiment metadata: namespace: kubeflow-allen-ng labels: controller-tools.k8s.io: "1.0" name: predict-salary-hyper spec: objective: type: maximize goal: 0.9 objectiveMetricName: accuracy algorithm: algorithmName: random parallelTrialCount: 3 maxTrialCount: 12 maxFailedTrialCount: 3 parameters: - name: --lr parameterType: double feasibleSpace: min: "0.01" max: "0.03" - name: --num_layers parameterType: int feasibleSpace: min: "50" max: "100" - name: --emb_szs parameterType: int feasibleSpace: min: "10" max: "50" trialTemplate: goTemplate: rawTemplate: |- apiVersion: batch/v1 kind: Job metadata: name: {{.Trial}} namespace: {{.NameSpace}} spec: template: spec: containers: - name: {{.Trial}} image: gcr.io/kubeflow-images-public/tensorflow-2.1.0-notebook-cpu:1.0.0 volumeMounts: - mountPath: /home/jovyan name: workspace-salary-prediction readOnly: true command: - "cd /home/jovyan && python3 hyper_tuning.py" {{- with .HyperParameters}} {{- range .}} - "{{.Name}}={{.Value}}" {{- end}} {{- end}} restartPolicy: Never volumes: - name: workspace-salary-prediction persistentVolumeClaim: claimName: workspace-salary-prediction

Explanation:We using objectiveMetricName: accuracy as the target metrics and the target value is goal: 0.9

说明：我们使用objectiveMetricName: accuracy作为目标指标，目标值为target goal: 0.9

lr: random from 0.01 to 0.03 num_layers: random from 50 to 100 emb_szs: random from 10 to 50

lr：从0.01到0.03随机num_layers：从50到100随机emb_szs：从10到50随机

We also configured the max count maxTrialCount: 12

我们还配置了最大计数maxTrialCount: 12

结果 (Result)

The job will automatically start once you submit it.

提交后，作业将自动开始。

The result will update on each job finished and you can see the result in HP -> Monitor

结果将在完成的每个作业上更新，您可以在HP > Monitor查看结果

Previously we didn’t even conduct HP tuning using only jupyter notebook, either you write a very huge loop that makes it run for a decade, or simply use your 6th sense to decide the HP.

以前，我们甚至没有只使用jupyter笔记本电脑进行HP调优，或者编写了一个很大的循环使其运行了十年，或者只是使用第六感来决定HP。

结论 (Conclusion)

Kubeflow is very good out of the box tool that allows you to set up the analysis environment without any pain. It provide several powerful modules like

Kubeflow是非常好的开箱即用工具，它使您可以轻松设置分析环境。它提供了几个强大的模块，例如

Also there are more features that we haven’t touched in this article as well, such as:

另外，本文还没有涉及更多功能，例如：

Namespaced permission management

命名空间权限管理 Sharing notebook among the team or outsiders

在团队或外部人员之间共享笔记本Continuous training and deployment for machine learning model

持续训练和部署机器学习模型Continuous ETL integration with cloud storage or data warehouse

与云存储或数据仓库的持续ETL集成

All of those are very common requirements from data scientists and it fit for most of the company as well.

所有这些都是数据科学家非常普遍的要求，它也适用于公司的大多数人。

We are still in the middle of transition so didn’t manage to cover all features on Kubeflow, will definitely want to write more about it after we explore more on it.

我们仍处于过渡之中，因此并未设法涵盖Kubeflow上的所有功能，在我们对其进行更多探索之后，肯定会想写更多有关它的内容。

我们正在招聘！ (We are hiring!)

We are the industry leader of chatbot commerce in Japan, Our company is based in Tokyo, if you are talented engineer and you are interested in our company! Simply to drop an application here we can start with some casual talk first.

我们是日本聊天机器人商务的行业领导者，如果您是一位有才华的工程师并且您对我们的公司感兴趣，那么我们的公司位于东京。只需在此处放置一个应用程序，我们就可以先从一些闲聊开始。

Opening Roles

职位空缺

(Sorry for that it’s still japanese right now, we are working on translating it to English right now)

(很抱歉，目前它仍是日语，我们正在努力将其翻译成英文)

翻译自: https://medium.com/zeals-tech-blog/introducing-kubeflow-to-zeals-c41b6199d2b9

kubeflow

相关资源：kubeflow镜像下载脚本

Processed: 0.008, SQL: 8