分布式系统的高可用性

科技2022-08-01 111

分布式系统的高可用性

Availability is one of the critical aspects of Distributed Systems. In this modern age, most systems need to guarantee availability. We may say, availability is the time a system remains functional to perform its required task in a specific period.

可用性是分布式系统的关键方面之一。在这个现代时代，大多数系统都需要保证可用性。我们可以说，可用性是指系统在特定时期内保持运行所需任务所需的时间。

According to Wikipedia, Availability is generally defined as uptime divided by total time (uptime plus downtime)

根据Wikipedia，可用性通常定义为正常运行时间除以总时间(正常运行时间加上停机时间)

We sometimes use the terms reliability and availability interchangeably, but they are not the same. Reliability is availability over time if we consider the full range of possible real-world situations that can occur. If a system is reliable, you can say it is available. However, if it is available, it may not necessarily be reliable.

有时我们可以互换使用术语可靠性和可用性，但是它们并不相同。如果我们考虑所有可能发生的现实情况，可靠性就是随着时间的推移而获得的可用性。如果系统可靠，则可以说它可用。但是，如果可用，它可能不一定可靠。

★为什么可用性至关重要？ (★Why is Availability critical?)

Let’s say we are launching a tutorial site. We have to make sure the availability of the videos of the tutorial, coding, and reading materials. Whenever clients go to the website of the tutorial or learning materials, they expect the system to be fully operational and available. The system may lose the client, and also it will be hard to get new clients. A broken system gives you bad publicity.

假设我们正在启动一个教程站点。我们必须确保提供教程，编码和阅读材料的视频。每当客户访问教程或学习资料的网站时，他们都希望该系统能够完全正常运行并可用。系统可能会失去客户，也很难获得新客户。损坏的系统会给您带来不好的宣传。

If the system is unavailable often, users will be dissatisfied. They may even switch to other competitor systems that provide the same service. So, for system designers, ensuring availability matters a lot.

如果系统经常不可用，用户将不满意。他们甚至可以切换到提供相同服务的其他竞争对手系统。因此，对于系统设计人员而言，确保可用性至关重要。

Now, these web services, if not available, may cause harm to their publicity, financial security, etc. But it will not be the end of the world. There is no such system that does not become unavailable sometimes.

现在，这些Web服务(如果不可用)可能会对其宣传，财务安全等造成损害。但这不会是世界末日。没有这样的系统有时不会变得不可用。

Let’s get into the example of supporting an airplane software. Software that helps an airplane to function properly when it’s in the air. Now, this kind of system becoming unavailable even for some minutes would be unacceptable. So, an aircraft that can be flown for many hours can be said to have an extremely high amount of availability. This can affect a life or death situation, as you can see.

让我们以支持飞机软件的示例为例。帮助飞机在空中飞行时正常运行的软件。现在，这种系统即使几分钟后仍无法使用，将是无法接受的。所以，可以飞行数小时的飞机可以说具有很高的可用性。如您所见，这可能会影响生死状况。

But we don’t need to go that far in our imagination to look for a life or death situation for high availability scenarios.

但是，我们无需花很多精力去寻找高可用性场景下的生死攸关情况。

Even a platform like youtube, imagine if youtube ever goes down, how much impact it will have on how many users. Hundreds of millions of people use youtube every day, many of them depend on youtube for their livelihood also. The same goes for cloud providers like AWS or Google cloud platforms. Can you imagine how many services will be blocked because of that? You might not be able to read this article right now!! Yes..that bad of an effect.

即使是像youtube这样的平台，也可以想象youtube是否出现故障，它将对多少用户产生多大影响。每天有成千上万人使用youtube，其中许多人也依靠youtube谋生。 AWS或Google云平台等云提供商也是如此。您能想象有多少服务会因此而被阻塞吗？您可能现在无法阅读本文！是的。效果不好。

One outage on such cloud providers can have huge and farreaching repercussions. All of these examples show us that availability matters a lot for system design.

此类云提供商的一次故障可能会产生巨大而深远的影响。所有这些示例向我们表明，可用性对系统设计至关重要。

★如何衡量可用性？ (★How to Measure Availability?)

Ok, we get it, availability is essential, but how do we measure availability? It can be calculated as the percentage of time that a system or service remains operational under normal conditions.

好的，我们明白了，可用性是必不可少的，但是我们如何衡量可用性呢？可以将其计算为系统或服务在正常条件下保持运行的时间百分比。

Let’s say we measure a system’s availability based on the percentage of its uptime in a year. So, if a system is is up and operational for six months of a year, it will have 50% availability. Now, imagine if Facebook or Uber were down 50% of the year!!! That’s really bad; it would not be acceptable. Nobody would use them, right?

假设我们根据一年中正常运行时间的百分比来衡量系统的可用性。因此，如果系统每年启动六个月并可以运行，它将具有50％的可用性。现在，假设Facebook或Uber一年下跌了50％！！！那真的很糟糕；这是不可接受的。没有人会使用它们，对不对？

In case of availability, we need to deal with very high percentages. To be honest, even the availability of 90% isn’t good enough. In this rate, Facebook would have been down for 2.5 hours every day. That’s 35 or 36 days a year. Would you use it then? I doubt even Zuckerberg would have used it if that happened. There is no way any product can survive in today’s market, with only 90% availability.

如果有可用性，我们需要处理很高的百分比。老实说，即使90％的可用性也不够好。以这样的速度，Facebook每天会宕机2.5个小时。一年35或36天。那你会用吗？我怀疑扎克伯格是否会使用这种方法。只有90％的可用性，任何产品都无法在当今市场上生存。

★什么是可用的Nines (★What are Nines in Availability)

If a system has availability of 99%, it is called two nines availability as the number nine appears two times. Even a 99% available system gives almost four days of downtime a year, which is unacceptable for services like Facebook, google.

如果系统的可用性为99％，则由于数字9出现两次，因此被称为2个9。即使是99％可用的系统，一年也会造成近四天的停机时间，这对于Facebook，google等服务是无法接受的。

Percentages of a availability are sometimes referred to by the number of nines or “class of nines” in the digits.

可用性百分比有时由数字中的9或“ 9类”表示。

Photo by 🇨🇭 Claudio Schwarz | @purzlbaum on Unsplash 🇨🇭Claudio Schwarz摄 | @purzlbaum在 Unsplash

For 99.9% availability is known as three nines and 99.99% as four nines availability. Five nines availability (99.999%) gives a 6 minutes downtime in a year, which you can say is the gold standard of high availability.

99.9％的可用性称为三个9，99.99％的可用性称为四个9。五分之九的可用性(99.999％)在一年中有6分钟的停机时间，您可以说这是高可用性的黄金标准。

High availability (HA) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.

高可用性(HA)是系统的一个特征，该系统旨在确保达到正常水平的正常运行时间(通常是正常运行时间)。

Having high availability comes with trade-offs like higher latency or lower throughput. When you are a system designer, it’s your job to think about these trade-offs and decide how high available systems you need. Maybe a part of your system needs to be highly available, not the whole system.

具有高可用性会带来诸如较高的延迟或较低的吞吐量之类的权衡。当您是系统设计师时，考虑这些折衷方案并确定所需的可用系统数量是您的工作。也许您系统的一部分需要高度可用，而不是整个系统。

Say, for example, let’s take the instance of Medium; it has many parts of the system. Do we need to have high availability of stat page in the profile? Not really. On the other hand, article writing, reading, and homepage loading parts need to be highly available. If you cannot see the stat page for some time, say 5–10 minutes, does it matter? I don’t think so. So, that part of the system can be less available (but of course, not 50% availability).

举例来说，让我们以中型为例；它具有系统的许多部分。配置文件中的统计信息页面是否需要高可用性？并不是的。另一方面，文章写作，阅读和主页加载部分需要高度可用。如果您在一段时间(例如5到10分钟)内看不到统计信息页面，这有关系吗？我不这么认为。因此，系统的那部分可用性可能会降低(但当然不是50％的可用性)。

★高可用性的解决方案是什么？ (★What is the solution for high availability?)

Now, we need to know how we can improve the availability of a system. The first and straightforward solution is your system should not have a single point of failure. So, if a component in the system is such that if that one fails, the entire system will fall.

现在，我们需要知道如何改善系统的可用性。第一个直接的解决方案是您的系统不应有单点故障。因此，如果系统中的某个组件发生故障，那么整个系统将崩溃。

Author) 作者提供的图像)

Redundancy is the solution to a single point of failure. Redundancy is the act of duplicating or multiplying a component of the system.

冗余是单点故障的解决方案。冗余是复制或倍增系统组件的行为。

Now let’s say a system has one server to handle the request of clients. It’s a single point of failure.

现在，假设系统有一个服务器来处理客户端的请求。这是单点故障。

So, we need to add more servers to handle the requests. But we will need a load balancer to balance the loads between the servers. Now, if the load balancer is out, the system will not be available.

因此，我们需要添加更多服务器来处理请求。但是我们需要一个负载平衡器来平衡服务器之间的负载。现在，如果负载均衡器不可用，则系统将不可用。

Author) Author )

So, we need multiple load balancers to remove the single point of failure of the system. Now, if one of the servers is down, other servers can handle the client requests. If a load balancer is down, other load balancers can take requests and send them to the server.

因此，我们需要多个负载均衡器来消除系统的单点故障。现在，如果其中一台服务器已关闭，则其他服务器可以处理客户端请求。如果负载均衡器已关闭，则其他负载均衡器可以接收请求并将其发送到服务器。

Author) 作者 )

结论： (Conclusion:)

Availability is a key property of a distributed system. If you want to make a system available, you need to eradicate any single point of failure in that system. And you can do so by making that part of the system redundant. Besides, you need to have a process in the system that handles system failures. In case of system failure, it may require human intervention to get the system back up; then you need to have processes in the system which ensure that human gets notification of the system failure within a timeframe.

可用性是分布式系统的关键属性。如果要使系统可用，则需要消除该系统中的任何单点故障。您可以通过使系统的该部分冗余来实现。此外，您需要在系统中有一个处理系统故障的过程。如果系统出现故障，可能需要人工干预才能恢复系统。那么您需要在系统中拥有一些流程，以确保人员能够在一定时间内获得系统故障的通知。

翻译自: https://towardsdatascience.com/availability-in-distributed-systems-adb43df78b9a

分布式系统的高可用性

相关资源：jdk-8u281-windows-x64.exe

Processed: 0.010, SQL: 8