We’ve all heard prod horror stories of a failed deployment. Chances are that you’ve even experienced one first hand! The below cautionary tales from across the internet show how non-existent processes, poor tooling, and talent drain can lead to huge headaches…and sometimes catastrophe.
How One Failed Deployment Bankrupted an Entire Company
This is the DevOps nightmare to end all DevOps nightmares.
Imagine if you had a system sending rapid, automated orders into the stock trading market, all without tracking to see if orders had already been executed?
In 2012, a company called Knight Capital lost $440 million in 45 minutes and disrupted the stock market, all due to one failed deployment.
Technical debt, zombie code, and a botched deployment process meant that new automated trading code made it to only 7 of their 8 production servers. However, the eighth server still had old broken code. When the server with the zombie code was activated, it began making automated trades in seconds.
One deployment caused a disruption in the prices of hundreds of stocks and moved millions of shares, all in minutes.
“When the market opened at 9:30 AM people quickly knew something was wrong. By 9:31 AM it was evident to many people on Wall Street that something serious was happening. The market was being flooded with orders out of the ordinary for regular trading volumes on certain stocks. By 9:32 AM many people on Wall Street were wondering why it hadn’t stopped. This was an eternity in high-speed trading terms. Why hadn’t someone hit the kill-switch on whatever system was doing this? As it turns out there was no kill switch.”
The kicker? Emails about the bad trades were sent to staff, but they weren’t marked as urgent alerts, allowing orders to continue until the problem server was discovered and manually shut down.
By then, it was too late.
Read more about this deployment catastrophe here:
Nepotism Strikes Again!
“Earlier that week, we gave a status update presentation to the board showing that we’d reached a milestone for core functionality and we were progressing nicely. In the CEO’s mind he thought, “looks good to me, let’s launch it”. So he told his son to launch the new site and never bothered to tell anybody in the IT department.
Nobody outside of the owner and his son knew this was going on and being abhorrently technically incompetent the son did a straight copy/paste of the development code into production and nuked what was live.”
Source: https://www.reddit.com/r/webdev/comments/1vfnr7/deployment_horror_stories/
Cut and Paste
“They … had a deployment script set up to update code to web servers. Around 120 web servers, code was copied via a robocopy command – pulling from one server that was manually updated and copying to the other 119.”
Source: https://www.reddit.com/r/devops/comments/qhqyhl/share_your_devops_horror_stories/
Name Spacing <<< Clusters?
“I worked at a start-up where the DevOps team was extremely inexperienced and ended up creating a kubernetes cluster per client they had instead of name spacing. They had over 100 clients. Each client was pretty much a snowflake in it’s code base as the development team was too inexperienced to implement multi tenant architecture. To add salt to the wound the DevOps team quickly went from 6 people to 2.
Needless to say maintaining production support for 100+ kubernetes clusters with +100 slightly varied application code bases was a bad time. Deploying was even worse.”
Source: https://www.reddit.com/r/devops/comments/qhqyhl/share_your_devops_horror_stories/
You’re Not Alone.
Remember however that a single developer is rarely responsible for breaking Prod entirely on their own. In this case, as with all of the above examples, there are usually systemic problems in an organization that allow that to happen.
While it’s true that you can’t fix stupid, you can prevent a lot of problems with a strong, secure, resilient deployment process, backed by the right tools. Organizations that focus on the systemic problems, and fix them, realize benefits and reduce risks.