Dec 16, 2021 by Daniel Gonzalez
When scaling a network service, there are always two concerns: resiliency and load distribution, to understand these concepts let us first understand the broader term “Redundancy”.
Redundancy is the duplication of a component to increase reliability of the system, usually in the form of a backup, fail-safe, or to improve actual system performance.
Resiliency is part of the “reliability” goal of redundancy, it comes in the form of High Availability (but it’s not limited to this).
Load distribution is part of the “system performance” goal.
Let’s not forget that there is also a case for redundancy when you want to have a service available in different zones, i.e clients in a different part of the world not having to send requests to a service hosted in a different continent, but this is a special case of replication for performance from the client’s perspective (although could be used for resiliency as well). We will leave this case out for now.
In an ideal case, we want to optimize for both resiliency and load but achieving this is challenging as there are always trade-offs and things to take into account when deciding how to do it.
Let’s dig deeper into this with a real-world scenario.
At Armory, we focus on deployment solutions at scale. Armory Agent for Kubernetes (Agent for short) is able to monitor Kubernetes clusters and execute CRUD operations of Kubernetes objects in it. It works as a bridge between Kubernetes and a continuous-delivery tool such as Spinnaker.
As you can imagine, any pipeline that targets a Kubernetes cluster will rely on the Agent. If the service is not available, there is a risk of pipelines halting. Much worse, you can end with an unexpected result the next time you retry it if the pipeline is not idempotent.
It is not realistic to expect atomicity in a pipeline or side-effect-free pipelines on a continuous delivery tool. This is where Resiliency comes into play. In the case of Armory Agent, having multiple replicas to ensure that the service is highly available will help prevent scenarios like the above.
High availability is achieved by replicating a component. And you get exact replicas by copying the state across replicas or by making a service as stateless as possible. That’s part of the popularity of JWT over sessions in REST: a system won’t have to migrate a session if a request arrives in a different replica other than the original one because a REST API is able to operate in a stateless manner.
Now going back to our case, remember that one of the features of the Agent is monitoring Kubernetes clusters.
What happens now is that you have more than one service monitoring the same cluster(s). While this is not a problem itself, nothing comes for free. In software engineering, there is no magic behind anything. Any extra events being sent means extra network calls, extra CPU usage and extra RAM usage of a node or server for events that the original replica is already sending.
Granted if one instance goes down, you can rest assured knowing that an identical one is hot and ready. However, there is not much use in monitoring the same cluster in different replicas and having all of them report the same stuff.
We have two main options:
But what happens if a master replica is getting clogged with too many write requests? Or what happens if the orchestrator goes down? For these issues, there are no other options than replicating the orchestrator or the master replica itself.
Does that mean it is not possible to achieve load distribution and high availability at the same time?
No, it does not. It is possible but there is cost involved depending on the type of service you are scaling. A REST API that’s doing nothing but waiting for HTTP requests can 100% be scaled and achieve both high availability and load distribution with the same instances. On the other hand, a stateful component, such as a database, would need a copy of its state across replicas and a master node as well as an active component, such as a message broker or Armory’s agent for Kubernetes who are constantly pushing events need an orchestration of some kind. These are a few examples:
High Availability should come first due to the risk of not having it, especially if there is loss of data involved. Usually scalability comes afterwards to support the first. There is no sense in having a highly performant cluster of services if it means that one of them crashing will crash the whole system.
Replicating for performance and fault tolerance are two different things, and you can’t expect to be able to tackle both with the same group of replicas.
When trying to achieve robustness in a software environment, it is very important to be aware of the nature of the components to scale, and scale accordingly depending on your use case.
You may find that a balance between the both is the way to go but never overlook the high availability of your services.
Multi-target deployments can feel tedious as you deploy the same code over and over to multiple clouds and environments — and none of them in the same way. With an automatic multi-target deployment tool, on the other hand, you do the work once and deliver your code everywhere it needs to be. Armory provides an […]
Read more →
KubeCon+CloudNativeCon EU is one of the world’s largest tech conferences. Here, users, developers, and companies who have and intend to adopt the Cloud Native standard of running applications with Kubernetes in their organizations come together for 5 days. From May 16-20, 2022, tech enthusiasts will congregate both virtually and in person in Valencia, Spain to […]
Read more →
Deciding how frequently to release a product is an interesting challenge faced by many companies. There are definite pros and cons related to adjusting your release cadence that have to be evaluated on an individual basis. Faster release cycles in theory might sound good, but of course, there can be tradeoffs. Looking at historical release […]
Read more →