When scaling a network service, there are always two concerns: resiliency and load distribution, to understand these concepts let us first understand the broader term “Redundancy”.
Redundancy is the duplication of a component to increase reliability of the system, usually in the form of a backup, fail-safe, or to improve actual system performance.
Resiliency is part of the “reliability” goal of redundancy, it comes in the form of High Availability (but it’s not limited to this).
Load distribution is part of the “system performance” goal.
Let’s not forget that there is also a case for redundancy when you want to have a service available in different zones, i.e clients in a different part of the world not having to send requests to a service hosted in a different continent, but this is a special case of replication for performance from the client’s perspective (although could be used for resiliency as well). We will leave this case out for now.
Why does resiliency matter?
In an ideal case, we want to optimize for both resiliency and load but achieving this is challenging as there are always trade-offs and things to take into account when deciding how to do it.
Let’s dig deeper into this with a real-world scenario.
At Armory, we focus on deployment solutions at scale. Armory Agent for Kubernetes (Agent for short) is able to monitor Kubernetes clusters and execute CRUD operations of Kubernetes objects in it. It works as a bridge between Kubernetes and a continuous-delivery tool such as Spinnaker.
As you can imagine, any pipeline that targets a Kubernetes cluster will rely on the Agent. If the service is not available, there is a risk of pipelines halting. Much worse, you can end with an unexpected result the next time you retry it if the pipeline is not idempotent.
It is not realistic to expect atomicity in a pipeline or side-effect-free pipelines on a continuous delivery tool. This is where Resiliency comes into play. In the case of Armory Agent, having multiple replicas to ensure that the service is highly available will help prevent scenarios like the above.
How is high availability achieved in software engineering?
High availability is achieved by replicating a component. And you get exact replicas by copying the state across replicas or by making a service as stateless as possible. That’s part of the popularity of JWT over sessions in REST: a system won’t have to migrate a session if a request arrives in a different replica other than the original one because a REST API is able to operate in a stateless manner.
Now going back to our case, remember that one of the features of the Agent is monitoring Kubernetes clusters.
What happens now is that you have more than one service monitoring the same cluster(s). While this is not a problem itself, nothing comes for free. In software engineering, there is no magic behind anything. Any extra events being sent means extra network calls, extra CPU usage and extra RAM usage of a node or server for events that the original replica is already sending.
Granted if one instance goes down, you can rest assured knowing that an identical one is hot and ready. However, there is not much use in monitoring the same cluster in different replicas and having all of them report the same stuff.
So how do we attain load distribution?
We have two main options:
- Configure each instance differently but you no longer have fault tolerance because now each of them become critical to their own subset of tasks.
- Apply a mechanism to make each replica aware of the other. Or you can have an orchestrator of replicas, but you introduce statefulness in the later cases.
But what happens if a master replica is getting clogged with too many write requests? Or what happens if the orchestrator goes down? For these issues, there are no other options than replicating the orchestrator or the master replica itself.
Does that mean it is not possible to achieve load distribution and high availability at the same time?
No, it does not. It is possible but there is cost involved depending on the type of service you are scaling. A REST API that’s doing nothing but waiting for HTTP requests can 100% be scaled and achieve both high availability and load distribution with the same instances. On the other hand, a stateful component, such as a database, would need a copy of its state across replicas and a master node as well as an active component, such as a message broker or Armory’s agent for Kubernetes who are constantly pushing events need an orchestration of some kind. These are a few examples:
- RabbitMQ: needs an additional message exchange with the sharding plugin. This means having one of the nodes perform additional tasks to maintain this.
- Armory Agent: requires configuring each instance differently or having account segmentation upon registration by the client (such as in the Clouddriver service).
- Apache Kafka: requires having different partitions in each broker, and there is usually a partition leader who resolves to which partition each message goes to based on its key. This means that you have additional work made by one broker and also one broker could be storing more data than others. And then all of this state is kept in Zookeeper.
What should come first, scalability or high availability?
High Availability should come first due to the risk of not having it, especially if there is loss of data involved. Usually scalability comes afterwards to support the first. There is no sense in having a highly performant cluster of services if it means that one of them crashing will crash the whole system.
Replicating for performance and fault tolerance are two different things, and you can’t expect to be able to tackle both with the same group of replicas.
When trying to achieve robustness in a software environment, it is very important to be aware of the nature of the components to scale, and scale accordingly depending on your use case.
You may find that a balance between the both is the way to go but never overlook the high availability of your services.