Skip to main content

How do you grow and scale?

Nov 4, 2021 by Christos Arvanitis

Unlocking innovation by making software delivery reliable, scalable, and safe is complex. It becomes even more complex when your delivery platform cannot scale properly and fails to meet the needs of your development teams when they need it most.

Spinnaker is awesome, no question. We know Spinnaker well, and we love Spinnaker, which is why we know that operating Spinnaker is complex. Between configuring Spinnaker, managing the underlying infrastructure, monitoring and meeting the SLAs/SLOs, your team will have their hands full. This (without a doubt) will lead to operational inefficiencies that will eventually slow down your deployment frequency.

Scale Spinnaker for faster growth

Operating at scale comes with a lot of pain points for the team operating Spinnaker. Armory’s Managed team operates Armory Enterprise for Spinnaker environments at scale every day, so we’re very aware of these pain points. Given that, we’ve compiled a few tips to help you scale:

  • Starting with the preferred choice of deploying Spinnaker, make sure that your Kubernetes cluster is provisioned with enough disk IOPS. Insufficient capacity can dramatically increase Clouddriver’s caching agent latency and the startup times for all Spinnaker services.
  • Plan ahead for your scaling needs and ensure that your Kubernetes cluster node pools are ready for both vertical and horizontal scaling of Spinnaker services.
  • Moving on to the caching layer. All services can use Redis, but is this the right choice? For Clouddriver, Orca, Front50, and Echo, it’s possible to use SQL for all the persistence use cases instead of Redis. Using SQL gives Spinnaker a performance boost, but there are a couple of things to consider (and always monitor):
    • SQL throughput and write latency: As you scale from a couple of dozen Clouddriver accounts to hundreds, you need your SQL to not act as a bottleneck to account caching. If you are using AWS RDS, consider pre-provisioning the storage capacity according to your desired disk IOPS.
    • Increase the JDBC connection pool size for Spinnaker services that use SQL to sufficiently support your load. Clouddriver is heavy on read/writing since it is caching your cloud providers, and Orca’s execution repository will do a huge amount of reads
    • All Spinnaker services can be horizontally and vertically scaled. Rely on JVM/Memory/CPU as well as other application metrics for each service to determine the appropriate scaling needs.
  • Always set PodDisruptionBudgets for the Spinnaker services, especially during operations like updating your Kubernetes version or updating/upgrading your node pools.
  • Monitor Redis to ensure the availability and performance of your environment. When you lose Redis, you lose Orca’s work queue, Fiat authorising your users, Rosco’s AMI baking.
  • Clouddriver keeps track of your cloud world but it needs caching agents. Increase your Clouddriver’s caching agents as those are scheduled per distinct cloudProvider, account, region, and resourceType. As a general principle:
    • Each EC2 account schedules 15-20 caching agents per region
    • Each ECS account 10 caching agents per region
    • Each Kubernetes account 2 caching agents per cacheThread.

Let’s put all of the above into an example:

Consider a very simple use case where we have 30 AWS accounts, 10 ECS accounts, 100 Kubernetes clusters, and we need to onboard another 50 Kubernetes clusters with 2 cacheThreads.

A rough estimate for the current running caching agents would be:

30 x 20 + 10 x 10 + 100 x 2 x 2 = 1100 caching agents

Clouddriver will need an additional 200 caching agents for the 50 new Kubernetes clusters. The startup time of Clouddriver on every deployment could be up to 8-10 mins, and we would need at least 13 Clouddriver replicas with the default concurrent caching agents config.

By using some of the tips discussed above this scenario would experience even greater scale. For instance, by tripling the concurrent caching agents and vertically scaling Clouddriver, you would need only 5 Clouddriver replicas. Additionally, by switching to 3000 baseline IOPS disks, you will drastically reduce the start-up time of your Spinnaker services.

Are these tips enough?

At Armory, we have open-sourced the Spinnaker Observability Plugin, and it is a great fit for monitoring when Spinnaker is operating at scale. Combined with the spinnaker-mixin dashboards, the plugin is a great solution for gaining insight into the operational performance of your Spinnaker platform and (by extension) your software delivery.

Your Spinnaker environment will perform faster, scale better and be more resilient, but some operational burden will still be there. For instance, every new Kubernetes account will require your team to setup the networking, proper permissions, configuration and then redeploy Clouddriver.

One option to consider to address ongoing operational issues is Armory Agent for Kubernetes. The Armory Agent for Kubernetes can help you reach massive scale by enabling account management at the team level and accelerating the onboarding of Kubernetes clusters on the fly without the need of redeploying Clouddriver.

For more information about either Armory’s Managed offering or the Armory Agent for Kubernetes, contact us. We’d love to hear from you!

Recently Published Posts

Welcoming 2022: Reflecting and looking forward

Dec 22, 2021
by Jim Douglas

Nearly all cultures globally have some form of celebration marking the Winter Solstice. Common threads found in most observances of the annual event are celebration of family and friends (living and past), reflection of the past year, and some form of giving thanks for continued health and sustenance. Exiting 2021, said celebrations would seem especially […]

Read more

Resiliency and Load distribution

Dec 16, 2021
by Daniel Gonzalez

Introduction When scaling a network service, there are always two concerns: resiliency and load distribution, to understand these concepts let us first understand the broader term “Redundancy”. Redundancy is the duplication of a component to increase reliability of the system, usually in the form of a backup, fail-safe, or to improve actual system performance. Resiliency […]

Read more

CVE-2021-44228 – log4j (Log4Shell) – an analysis

Dec 10, 2021
by Jason McIntosh

Today marked a 0-day disclosure of a rather nasty vulnerability in one of the most commonly used frameworks for logging – log4j.  This one is nasty on multiple levels.  Note that Armory Enterprise is NOT affected by this vulnerability.  The impact on this vulnerability is likely huge and is already being exploited.  Additionally it can […]

Read more