argo airflow

科技2022-08-03 137

argo airflow

At Soluto, our microservices-based infrastructure, combined with all the CI/CD tools, allow us to move fast with multiple releases a day to deliver features and fixes to our customers.

在Soluto，我们基于微服务的基础架构与所有CI / CD工具相结合，使我们每天可以快速移动多个版本，以向客户提供功能和修复。

In this storm of releases, sometimes issues are found in production, after the release. When this happens, we want to protect our customers from being exposed to these issues and at the same time, we want to know about them early on. This is where Argo Rollouts comes into the picture, with its support for canary release strategies.

在这种发行风暴中，发行后有时会在生产中发现问题。发生这种情况时，我们希望保护我们的客户免受这些问题的困扰，同时，我们希望尽早了解这些问题。这是Argo Rollouts出现的地方，它支持金丝雀释放策略。

免责声明 (Disclaimer)

There are a lot of guides out there for setting up a canary release process in your K8s cluster, and this post is not a “how to” — it assumes you know how Argo Rollouts works, it’s about choices, issues you might face, and optimizations you will need in order to make your canary release process efficient. You can read the whole thing for better understanding or you can skip to the summary and lessons learned sections for a quick read.

有很多指南可以在您的K8s集群中设置金丝雀释放过程，而这篇文章不是“如何”的，它假设您知道Argo Rollouts的工作原理，它涉及选择，您可能面临的问题以及为了使金丝雀释放过程高效，您将需要进行优化。您可以阅读全文以更好地理解，也可以跳到摘要和经验教训部分以快速阅读。

我们想从金丝雀发布中得到什么？ (What did we want from canary releases?)

Our mission when we decided to implement canary was:

当我们决定实施金丝雀时，我们的任务是：

“On each new deployment, keep the old version running and give time for the new version to prove it works in production with minimum exposure to customers”

“在每个新的部署中，保持旧版本运行，并留出时间让新版本证明它可以在生产中工作，并且对客户的影响最小”

This means that we have to:

这意味着我们必须：

Run two versions of the service at the same time in the production K8s cluster

在生产K8s集群中同时运行两个版本的服务 Split production traffic between the two versions in a controlled manner

以受控方式在两个版本之间分配生产流量 Automatically analyse how well the new version is performing

自动分析新版本的执行情况 Automatically replace versions or rollback based on the result of the analysis

根据分析结果自动替换版本或回滚

Argo推广 (Argo Rollouts)

There are solutions out there for doing canary releases like flagger (which is planned to merge with Argo rollouts, yay!) but our choice was Argo because:

有一些解决方案可以进行诸如canner的金丝雀发布( 计划与Argo推出合并，是的！)，但是我们选择Argo是因为：

Traffic splitting support without using a mesh provider for internal traffic that doesn’t go through Ingress

流量拆分支持，无需使用网状网络提供程序来处理不通过Ingress的内部流量 We had Argo CD and it easily integrates with it (more on that later)

我们有Argo CD，它很容易与它集成(稍后会详细介绍)

Meaning if you don’t have a mesh provider (Istio), Argo Rollouts splits traffic between versions by creating a new replica set that uses the same service object, and the service will still split the traffic evenly across pods (new and old). In other words, controlling the number of pods controls the traffic percentage.

这意味着，如果您没有网格提供程序(Istio)，则Argo Rollouts会通过创建使用相同服务对象的新副本集来在版本之间划分流量，并且该服务仍会在pod(新旧)之间平均分配流量。换句话说，控制窗格的数量可以控制流量百分比。

Traffic splitting by pods count 按窗格数划分流量

Not ideal, but does the job, and if you do have Istio, you can better control traffic as mentioned here.

不理想，但它的工作，如果你有Istio，你可以更好地控制流量这里提到。

实施 (The implementation)

金丝雀的生命周期 (The canary lifecycle)

When thinking about our mission, we start shaping the desired behaviour of our canary process:

在考虑我们的任务时，我们开始塑造金丝雀过程的预期行为：

We have version N already deployed on production

我们已经在生产中部署了版本N Version N+1 was deployed as a canary rollout

N + 1版本已部署为Canary部署 Now both versions of N and N+1 exist and N+1 is assigned a percentage of the traffic

现在，同时存在N和N + 1版本，并且为N + 1分配了一定百分比的流量 Wait for a while

稍等片刻 Measure how well N+1 is performing

衡量N + 1的表现 If performing well, increase traffic and go to step 4

如果效果不错，请增加流量并转到步骤4 If not performing well, delete N+1 and move all traffic back to N

如果效果不佳，请删除N + 1，然后将所有流量移回N Repeat steps 4–7 as needed until all analysis completes successfully, then switch all traffic to N+1 and delete N

根据需要重复步骤4–7，直到所有分析成功完成，然后将所有流量切换为N + 1并删除N

编写真实实用的金丝雀分析 (Writing a real and practical canary analysis)

The combination of Rollout and AnalysisTemplate in Argo Rollouts are enough to give us the flexibility to configure a canary release strategy like the one above, in terms of steps, traffic control and analysis, but what is a good strategy? There’s no one correct answer for this, it’s per use case. Below is the process of how we shaped a good fit for our use case, it may also be a good fit for you or at least an inspiration.

在Argo Rollouts中将Rollout和AnalysisTemplate结合使用足以使我们灵活地配置上述步骤中的金丝雀释放策略(在步骤，流量控制和分析方面)，但是什么是好的策略？没有一个正确的答案，这是针对每个用例的。以下是我们如何为用例定型的过程，也可能对您或至少是一个启发。

Starting from scratch, we asked ourselves some questions:

从头开始，我们问自己一些问题：

给予金丝雀的“良好流量百分比”是多少？ (What is a “good traffic percentage” to give a canary?)

If it’s too low, the canary will not get enough traffic, hence an unreliable analysis, and too much traffic will affect more customers if an issue happens. In other words, a canary should handle between 5% to 10% of your total traffic.

如果太低，金丝雀将无法获得足够的流量，因此分析不可靠，如果发生问题，太多的流量会影响更多的客户。换句话说，金丝雀应能处理您总流量的5％至10％。

金丝雀分析应该运行多长时间？ (How long should we run the canary analysis?)

For the metrics in the statistical analysis to be reliable, we need at least 50 data points for good results, meaning that we need to pause for a duration that allows the monitoring system (Prometheus, DataDog, etc) to collect metrics at least 50 times. The time can vary between different setups, but in our case, it was improved to collect metrics every 15 seconds, meaning the minimum pause time we could have is around 12.5 minutes with active users generating traffic.

为了使统计分析中的指标可靠，我们需要至少50个数据点才能获得良好的结果，这意味着我们需要暂停一段时间，以使监视系统(Prometheus，DataDog等)至少收集50次指标。时间因不同的设置而异，但在我们的案例中，它进行了改进，每15秒收集一次指标，这意味着在活动用户产生流量的情况下，我们可以拥有的最短暂停时间约为12.5分钟。

我们应该包括几步？ (How many steps should we include?)

There’s no correct answer to this question, but our developers didn’t want to wait a lot to know if something is wrong, and we needed to protect our customers:

这个问题没有正确的答案，但是我们的开发人员不想等待太多，以了解是否存在问题，因此我们需要保护客户：

First Step (Fast Fail): If there’s an obvious issue in the new release, developers did not want to wait for the whole analysis to finish in order to see it. We wanted something that can tell us “Hey! Your code doesn’t work!” very quickly, so this is the analysis for the first step:

第一步(快速失败) ：如果新版本中存在明显的问题，则开发人员不想等待整个分析结束就可以看到它。我们想要可以告诉我们“嘿！您的代码无效！” 非常快，因此这是第一步的分析：

5% traffic

5％的流量 13 minutes pause (again 50 data points)

暂停13分钟(再次获得50个数据点) Measure metrics after and either kill canary or move to next step

衡量指标，然后杀死金丝雀或移至下一步

Steps 2 and 3 are identical: 10% traffic with 30 minutes duration

步骤2和3是相同的：10％的流量，持续时间为30分钟

This seemed good, a total of 1.25 hours of canary run time, we didn’t have to wait a lot… But as it turned out, it wasn’t good enough. See why in the “lessons learned section”.

这看起来不错，总共有1.25小时的金丝雀运行时间，我们不必等待太多时间……但是事实证明，这还不够好。在“经验教训”部分中查看原因。

我怎么知道发生了什么事？ (How do I know what’s happening?)

When a canary rollout is running, we need to see what’s happening, you can use any of these:

当金丝雀推出时，我们需要看看发生了什么，您可以使用以下任一方法：

Argo Rollouts kubectl plugin — provides a nicely formatted output for the status of a canary

Argo Rollouts kubectl插件 —为金丝雀的状态提供格式正确的输出

Argo CD — if you use Argo CD, you already know it has a nice UI for showing you the status of the application inside your cluster, and when you combine it with rollouts, you can see what’s happening in real time with a beautiful UI.

Argo CD-如果您使用Argo CD，您已经知道它具有一个不错的UI，可以向您显示群集中应用程序的状态，将其与展示结合使用时，您可以通过精美的UI实时查看正在发生的事情。

Metrics — yes, Argo Rollouts exposes metrics that you can build dashboards with so you can show the status of all your rollouts like this:

指标 -是的，“ Argo推广”提供了可用于构建仪表板的指标，因此您可以显示所有推广的状态，如下所示：

要衡量哪些指标？ (Which metrics to measure?)

We have two types of services:

我们提供两种服务：

API (http) based services: exposes APIs called using http protocol

基于API(http)的服务：公开使用http协议调用的API Worker services: they consume from a queue system, like pub/sub kafka, etc

工作者服务：他们从诸如pub / sub kafka等队列系统中使用

In both of them, we need to measure:

在这两个方面，我们都需要测量：

Success rate: the percentage of successful operations should be more than 95%

成功率：成功手术率应在95％以上 In APIs, it’s the percentage of 2xx responses to total responses

在API中，它是2xx响应占总响应的百分比 In a worker, it’s more complicated, there are no http calls, so how can we measure success?

在一个工作者中，它更复杂，没有http调用，那么我们如何衡量成功呢？ Latency: how long it took to complete an operation

延迟：完成一项操作需要多长时间 In APIs, it’s the total time from the start of the request until the response is returned

在API中，它是从请求开始到返回响应的总时间 In a worker, it’s the total time from when the message is consumed until it’s processed and acknowledged.

在工作人员中，这是从使用消息到对其进行处理和确认为止的总时间。

To unify the metrics as much as possible between an API and a worker, we made use of the fact that we are using the sidecar pattern in workers, where a sidecar consumes from the queue and calls the main service using http:

为了尽可能统一API和工作程序之间的指标，我们利用了在工作程序中使用sidecar模式这一事实，其中sidecar从队列中消费并使用http调用主服务：

Sidecar pattern and http based metrics Sidecar模式和基于http的指标

This converted the main service into an http based service, where we can use the same metrics as an API… good for us.

这将主服务转换为基于http的服务，在这里我们可以使用与API相同的度量标准……对我们有好处。

运行一切 (Running everything)

We let the developers use it for a few months, and we observed. Canary saved the day more than a few times, but failed to save it in some situations… Yes, the short running duration was the culprit, and some other factors that are explained here.

我们让开发人员使用了几个月，我们观察到。金丝雀挽救了一天多次，但在某些情况下却未能挽回一天。是的，持续时间短是罪魁祸首，这里还解释了其他一些因素。

使用金丝雀的经验教训 (Lessons learned from using canaries)

金丝雀需要交通 (Canaries need traffic)

Oh yes, the more traffic the more operations, and more operations mean a more accurate analysis. In our case, our customers are in a different timezone than the one our releases are scheduled by, meaning that when a canary runs, it doesn’t have the regular traffic size it should have. We learned the hard way that issues are more likely to be discovered if an analysis is run when our customers are more active.

哦，是的，流量越多，操作越多，而更多的操作意味着更准确的分析。在我们的案例中，我们的客户所处的时区与我们的发布所计划的时区不同，这意味着当金丝雀运行时，它没有应有的正常流量。我们了解了一种艰难的方式，即如果在客户更加活跃的情况下运行分析，则更有可能发现问题。

To overcome this, the canary analysis should span the high traffic times, so a canary analysis that spans more than 24 hours is more likely to safely detect more issues.

为了克服这个问题，金丝雀分析应该跨越高流量时间，因此跨越24小时以上的金丝雀分析更有可能安全地检测更多问题。

提供跳过金丝雀分析的功能 (Provide the ability to skip canary analysis)

When there are issues in production that need an urgent quick fix, we can’t afford to wait 24 hours for the fix to be released, so in the CD pipelines, we added the option to skip the canary analysis when releasing. This option was a good addition to use in such cases.

当生产中的问题需要紧急快速修复时，我们无须等待24小时发布修复程序，因此在CD管道中，我们添加了在发布时跳过金丝雀分析的选项。在这种情况下，此选项是很好的补充。

分析中的称量操作 (Weighing operations in the analysis)

For example, if we want a success rate of at least 95%, this prometheus query calculates it:

例如，如果我们希望成功率至少为95％，则此prometheus查询将对其进行计算：

sum(increase(http_request_duration_seconds_count{status_code=~”2.*”}[15m])) / sum(increase(http_request_duration_seconds_count[15m]))

As you can see, it sums up all requests from all endpoints, but some endpoints are called more than others, and some less. This doesn’t mean they are less or more important, but in that query, endpoints that are called less don’t have an effect on the final percentage.

如您所见，它汇总了来自所有端点的所有请求，但是某些端点的调用数量比其他端点多，而某些则少。这并不意味着它们越来越重要，但是在该查询中，被称为less的端点不会影响最终百分比。

This introduces the need to change the query to be the weighted sum of each endpoint. For example, considering two endpoints:

这引入了将查询更改为每个端点的加权和的需求。例如，考虑两个端点：

(sum( increase( http_request_duration_seconds_count{ status_code=~”2.*”, path=”/api/v1/getSomething/” }[15m] ) ) / sum( increase( http_request_duration_seconds_count{ path=”/api/v1/getSomething/” }[15m] ) )) * 0.2+(sum( increase( http_request_duration_seconds_count{ status_code=~”2.*”, path=”/api/v1/getAnotherThing/” }[15m] ) ) /sum( increase( http_request_duration_seconds_count{ path=”/api/v1/getAnotherThing/” }[15m] ) )) * 0.8

Since getAnotherThing gets a lot less traffic than getSomething, we increased its impact on the final result. This is good in terms of numbers, but the resulting query can be a nightmare to maintain, so be careful when considering when you really need to do this.

由于getAnotherThing的流量比getSomething少得多，因此我们增加了对最终结果的影响。就数字而言，这很好，但是生成的查询可能是维护的噩梦，因此在考虑何时真正需要执行此操作时要小心。

总结一下 (To summarise things)

After running a canary release process in our kubernetes clusters using Argo Rollouts for a few months, and after observing and collecting feedback, we reached the following optimisations for a more practical and efficient canary release process:

在使用Argo Rollouts在我们的kubernetes集群中运行金丝雀释放过程几个月之后，在观察和收集反馈之后，我们进行了以下优化，以实现更加实用和有效的金丝雀释放过程：

Load routed to your canary should be no less than 5% and no more than 10% of the traffic.

路由到您的金丝雀的负载应不小于流量的5％，且不得超过流量的10％。

Using a sidecar pattern helps unify the metrics used in your services.

使用Sidecar模式有助于统一服务中使用的指标。

Make the canary analysis runs for longer periods that span high traffic periods for better analysis, this means more than 24 hours.

使金丝雀分析在更长的时间段内运行，跨越高流量的时间段以进行更好的分析，这意味着需要超过24小时。

To reduce the developers’ frustration, the first step of the analysis can be a “Fail Fast” step, where the duration of the analysis is the minimum time needed for the monitoring system to collect 50 data points from the metrics your service exposes (usually 50 minutes or if it collects every 15 seconds, then it’s 13 minutes).

为了减少开发人员的挫败感，分析的第一步可以是“ 快速失败”步骤，其中分析的持续时间是监视系统从服务公开的指标中收集50个数据点所需的最短时间 (通常是50分钟，或者每15秒收集一次，则为13分钟)。

The more traffic the better. You should try and release during your peak usage hours, if that’s not possible, either simulate more traffic when you do a release or make it run for longer periods as mention in bullet #3.

流量越多越好。您应该尝试在高峰使用时间内释放，如果不可能的话，要么在释放时模拟更多的流量，要么使它运行更长的时间，如项目3中所述。

If you can, make your analysis — Prometheus query, Datadog query, or others — a sum of the weight of each API/operation instead of the total average, since some endpoints are not called that often but they are as important as the others.

如果可以的话，请进行分析-Prometheus查询，Datadog查询或其他-是每个API /操作权重的总和而不是总平均值，因为某些端点的调用频率不高，但它们与其他端点一样重要。

Make your developers aware of the status of their release by creating dashboards in Grafana that show the status of their canary release and/or combining it with notifications.

通过在Grafana中创建显示其金丝雀发布状态的仪表板和/或将其与通知结合，使开发人员了解其发布状态。

Finally, provide the option for your developers to skip canary analysis on new releases. This is particularly useful for urgent fixes to production bugs that cannot afford to wait days for the release to be rolled out.

最后，为您的开发人员提供一个选项，可以跳过对新版本的金丝雀分析。这对于紧急修复生产错误特别有用，因为生产错误无法等待几天才能发布该版本。

It’s recommended to follow this formula to get the most out of your canary release process, and of course, you should tweak it and modify it to better fit your use cases, since in canary, there’s no “one ring to rule them all”.

建议遵循此公式以充分利用canary发布过程，当然，您应该对其进行调整和修改，以使其更适合您的用例，因为在canary中，没有“一环可治”。

翻译自: https://medium.com/soluto-engineering/practical-canary-releases-in-kubernetes-with-argo-rollouts-933884133aea

argo airflow

相关资源：微信小程序源码-合集6.rar

Processed: 0.011, SQL: 8