Resiliency and Load distribution

Dec 16, 2021 by Daniel Gonzalez


When scaling a network service, there are always two concerns: resiliency and load distribution, to understand these concepts let us first understand the broader term “Redundancy”.

Redundancy is the duplication of a component to increase reliability of the system, usually in the form of a backup, fail-safe, or to improve actual system performance.

Resiliency is part of the “reliability” goal of redundancy, it comes in the form of High Availability (but it’s not limited to this).

Load distribution is part of the “system performance” goal.

Let’s not forget that there is also a case for redundancy when you want to have a service available in different zones, i.e clients in a different part of the world not having to send requests to a service hosted in a different continent, but this is a special case of replication for performance from the client’s perspective (although could be used for resiliency as well). We will leave this case out for now.

Why does resiliency matter?

In an ideal case, we want to optimize for both resiliency and load but achieving this is challenging as there are always trade-offs and things to take into account when deciding how to do it.

Let’s dig deeper into this with a real-world scenario.

At Armory, we focus on deployment solutions at scale. Armory Agent for Kubernetes (Agent for short) is able to monitor Kubernetes clusters and execute CRUD operations of Kubernetes objects in it. It works as a bridge between Kubernetes and a continuous-delivery tool such as Spinnaker.

As you can imagine, any pipeline that targets a Kubernetes cluster will rely on the Agent. If the service is not available, there is a risk of pipelines halting. Much worse, you can end with an unexpected result the next time you retry it if the pipeline is not idempotent.

It is not realistic to expect atomicity in a pipeline or side-effect-free pipelines on a continuous delivery tool. This is where Resiliency comes into play. In the case of Armory Agent, having multiple replicas to ensure that the service is highly available will help prevent scenarios like the above.

How is high availability achieved in software engineering?

High availability is achieved by replicating a component. And you get exact replicas by copying the state across replicas or by making a service as stateless as possible. That’s part of the popularity of JWT over sessions in REST: a system won’t have to migrate a session if a request arrives in a different replica other than the original one because a REST API is able to operate in a stateless manner.

Now going back to our case, remember that one of the features of the Agent is monitoring Kubernetes clusters.

What happens now is that you have more than one service monitoring the same cluster(s). While this is not a problem itself, nothing comes for free. In software engineering, there is no magic behind anything. Any extra events being sent means extra network calls, extra CPU usage and extra RAM usage of a node or server for events that the original replica is already sending.

Granted if one instance goes down, you can rest assured knowing that an identical one is hot and ready. However, there is not much use in monitoring the same cluster in different replicas and having all of them report the same stuff.

So how do we attain load distribution?

We have two main options:

But what happens if a master replica is getting clogged with too many write requests? Or what happens if the orchestrator goes down? For these issues, there are no other options than replicating the orchestrator or the master replica itself.

Does that mean it is not possible to achieve load distribution and high availability at the same time?

No, it does not. It is possible but there is cost involved depending on the type of service you are scaling. A REST API that’s doing nothing but waiting for HTTP requests can 100% be scaled and achieve both high availability and load distribution with the same instances. On the other hand, a stateful component, such as a database, would need a copy of its state across replicas and a master node as well as an active component, such as a message broker or Armory’s agent for Kubernetes who are constantly pushing events need an orchestration of some kind. These are a few examples:

What should come first, scalability or high availability?

High Availability should come first due to the risk of not having it, especially if there is loss of data involved. Usually scalability comes afterwards to support the first. There is no sense in having a highly performant cluster of services if it means that one of them crashing will crash the whole system.


Replicating for performance and fault tolerance are two different things, and you can’t expect to be able to tackle both with the same group of replicas.

When trying to achieve robustness in a software environment, it is very important to be aware of the nature of the components to scale, and scale accordingly depending on your use case.

You may find that a balance between the both is the way to go but never overlook the high availability of your services.


Share this post:

Recently Published Posts

Continuous Deployments meet Continuous Communication

Sep 7, 2023

Automation and the SDLC Automating the software development life cycle has been one of the highest priorities for teams since development became a profession. We know that automation can cut down on burnout and increase efficiency, giving back time to ourselves and our teams to dig in and bust out innovative ideas. If it’s not […]

Read more

Happy 7th Birthday, Armory!

Aug 21, 2023

Happy 7th birthday, Armory! Today we’re celebrating Armory’s 7th birthday. The parenting/startups analogy is somewhat overused but timely as many families (at least in the US) are sending their kids back to school this week. They say that parenting doesn’t get easier with age – the challenges simply change as children grow, undoubtedly true for […]

Read more

Visit the New Armory Developer Portal

Aug 11, 2023

Easier Access to Tutorials, Release Notes, Documentation, and More! Developer Experience (DX) is one of Armory’s top focuses for 2023. In addition to improving developer experience through Continuous Deployment, we’re also working hard to improve DX for all of our solutions.  According to ThoughtWorks, poor information management and dissemination accounts for a large percentage of […]

Read more