A Scripting Nightmare That You Can’t Wake Up From: How to Develop a Stable Home-Grown CI/CD Platform
Developing a stable home-grown CI/CD platform can cost a lot of engineering effort, and the benefits for business and products are not always clear. In this article, we explore how engineering effort is wasted developing home-grown CI/CD solutions with disjointed tooling and ad-hoc scripting (glue code), or even building tools with a feature set similar to existing commercial offerings.
While it’s important to have a reliable delivery system in place, improving business-related metrics, such as cycle time, requires changing the actual delivery process that takes changes from commit into production.
Let’s travel back in time to 2010. Jenkins is the undisputed leader in CI tools. Back then, many organizations had not even adopted automated builds, much less think about CI. Travis CI will enter the scene in 2011 and will have focused adoption in the open source sector by 2012 with their “love.travis-ci.org” crowdfunding campaign.
Also in 2010, the Continuous Delivery book is published, winning the Jolt award for top book of the year. A core idea introduced by Dave Farley and Jez Humble was, of course, the [deployment pipeline].
As word spread around, plugins for Jenkins popped up. These were, in essence, a visualization tool–a wrapper for the Jenkins concepts of “jobs” and “upstream and downstream connections.”
At the time, this was an advancement in visualizing progress and raising awareness of build, test, and deploy failures across development teams. However, there were no abstractions in Jenkins regarding pipeline definition, pipeline dependencies, grouping, role-based approvals, much less what we now know as “pipeline-as-code.”
First, the engineer would add new jobs and specify dependencies on other jobs via the UI. Then run a build and see if the pipeline visualization corresponds to what they had in mind. All the manual configurations were stored internally by Jenkins and reproduced manually for every new pipeline. Changes to existing pipeline flow required a trial and error process, switching between job definitions and the latest pipeline view until you got it right.
This lack of native pipeline design, coupled with a lack of transparent source control for the job definitions, made pipeline configurations extremely error-prone, time-consuming, and unreliable.
A Scripting Nightmare That You Can’t Wake Up From
Fast forward to 2018. We now have our pipeline definitions in code (or config), version controlled in git, along with all the scripts we need to execute tasks in our pipeline. Changes are picked up and automatically applied by our CI/CD tool.
It’s all good, right? Well, not exactly. The good news is that we are now able to codify our delivery processes and automate them where possible. The bad news is that the resulting code (scripts) is often a fragile amalgamation of tiny changes made by many different people over a long period.
Refactoring and parameterizing into common functions and replacing it with third-party tools that can do the job better is a good way forward. However, whether you tackle a large number of scripts over a long period of time (following the “boy scout rule” of leaving code better than you found it) or you dedicate multiple people (or teams) to fix all the code as quickly as possible; the transition is always costly. Do you take off the band-aid quickly and painfully or slowly with less pain over a more extended period?
The pain is aggravated without a realistic test environment for pipeline changes–especially since most organizations don’t have one–causing more rollbacks and failing pipelines while changes to the scripts are undergoing.
After all is said and done, the code may soon begin to decay until it’s screaming for help again. Without changes in ownership of the pipeline orchestration code and reasonable quality safeguards around it, it’s unlikely that the refactoring work pays off in the long run.
Early adopters who understood the importance of CD for keeping their software delivery fast and visible often went the route of creating homegrown tools to fit their exact requirements and delivery processes.
This approach can deliver a significant number of benefits, such as:
- Reducing unnecessary steps and configurations–tool has sensible, context-aware defaults
- Speeding up the delivery process by optimizing known common steps
- Providing a familiar user experience for developers and other stakeholders
- Reducing boilerplate code (in job/stage definitions) and scripting nightmare
However, over time the delivery process naturally gets more and more entrenched in the homegrown tools. Reality is that the more a tool is fit for business by embedding process knowledge, the less it is fit for engineering. It is particularly true for new, groundbreaking practices such as canary releasing, feature toggles, or chaos engineering, to name a few.
Tooling Team Gone Wrong
Besides the homegrown tooling becoming an impediment to adopting game-changing practices, there’s also the social aspect around the team tasked with maintaining and improving the homegrown tools.
Naturally, this team feels emotionally invested in the tools. Even if the tools have reached its peak of maturity and extensibility, the team feels the need to keep going, which can lead to one or more of the following pernicious effects:
- Allow feature creep without the adequate vetting of new requests
- Allow Interesting but unnecessary technical advancements, for example, rewriting parts of the tooling in a newer, cooler programming language.
- Build features that are similar to tools already available in the market (which is likely of better quality since it is battle-tested by the mass market).
Meanwhile, senior management who initially decided to invest in the internal tooling team will probably not entertain much discussion on moving to third-party tools, even when the latter provides the features that the tooling team plans to build out. The sunk cost fallacy kicks in, with an emotional response of “we can’t let go after all this investment” prevailing over a more rational “cost of delay” approach.
The State of DevOps report is clear. High performers can deploy 200x more often than low performers and lead time is 2,555x shorter! Reaching these impressive figures requires, understandably, a long journey of continuous improvement and sound engineering practices. The recent book “Accelerate: Building and Scaling High Performing Organizations” by Nicole Forsgren, Jez Humble, and Gene Kim, provides data-based insights on what those practices are as of today.
Two approaches to CI/CD limit the ability of organizations to reach peak performance. One approach (perhaps better stated as a lack of approach) is the unplanned, ad-hoc adoption of a toolchain, made up of pieces proposed by enthusiastic individuals or teams. The repercussion of this approach lingers on throughout many years and compounds as delivery requirements evolve, as it originated with little concern for fitness for purpose.
A disjointed toolchain also makes it challenging to conflate data to analyze the flow of work. Given that up to 85% of cycle time is often spent in wait states, it is critical to understand the time-consuming parts of the delivery cycle but more importantly, where and how long changes are waiting for someone to take action or approve them. Once we understand that, we can accelerate and optimize cycle time by reducing waits, handovers, unnecessary steps and approvals, and so on.
The second approach that can hinder a faster cycle time is homegrown tools. While these might be a better fit for the organization initially, they tend not to age well, preventing the timely adoption of emerging engineering best practices.
Adopting battle tested third-party tools and dedicating engineering effort to build pipeline orchestration code that is modular and sufficiently abstracted is the first step towards fast and evolvable delivery processes. This will also ease any future transitions to new tools, as it becomes necessary to support changes to the delivery process and new engineering initiatives for the organization.
Dedicating effort to capture current cycle time is critical to identifying bottlenecks and wait states which in turn allows for optimization of cycle times.
Removing potential blockers in the delivery process–manual tests, approvals, handovers, dependencies between teams, and historical ceremonies–requires considerable effort from multiple individuals and likely teams, but ultimately it provides the most value to the organization. Remember, the more fragmented and ad-hoc the tooling, the higher the initial effort.
The concepts identified in this article have been around for a while, but only recently crossed the chasm of adoption in the mainstream. However, now the time to adopt is running out and organizations lagging behind are becoming increasingly at risk of losing their edge to competitors with higher deployment frequencies and state of the art engineering practices.