Tales from the Trenches: Deployed-to-Prod Horror Stories hero graphic

Tales from the Trenches: Deployed-to-Prod Horror Stories

Jul 14, 2023 by Adam Frank

We’ve all heard prod horror stories of a failed deployment. Chances are that you’ve even experienced one first hand! The below cautionary tales from across the internet show how non-existent processes, poor tooling, and talent drain can lead to huge headaches…and sometimes catastrophe.

How One Failed Deployment Bankrupted an Entire Company

This is the DevOps nightmare to end all DevOps nightmares.

Imagine if you had a system sending rapid, automated orders into the stock trading market, all without tracking to see if orders had already been executed? 

In 2012, a company called Knight Capital lost $440 million in 45 minutes and disrupted the stock market, all due to one failed deployment. 

Technical debt, zombie code, and a botched deployment process meant that new automated trading code made it to only 7 of their 8 production servers. However, the eighth server still had old broken code. When the server with the zombie code was activated, it began making automated trades in seconds. 

One deployment caused a disruption in the prices of hundreds of stocks and moved millions of shares, all in minutes. 

“When the market opened at 9:30 AM people quickly knew something was wrong. By 9:31 AM it was evident to many people on Wall Street that something serious was happening. The market was being flooded with orders out of the ordinary for regular trading volumes on certain stocks. By 9:32 AM many people on Wall Street were wondering why it hadn’t stopped. This was an eternity in high-speed trading terms. Why hadn’t someone hit the kill-switch on whatever system was doing this? As it turns out there was no kill switch.”

The kicker? Emails about the bad trades were sent to staff, but they weren’t marked as urgent alerts, allowing orders to continue until the problem server was discovered and manually shut down.

By then, it was too late.

Read more about this deployment catastrophe here:

Nepotism Strikes Again!

“Earlier that week, we gave a status update presentation to the board showing that we’d reached a milestone for core functionality and we were progressing nicely. In the CEO’s mind he thought, “looks good to me, let’s launch it”. So he told his son to launch the new site and never bothered to tell anybody in the IT department. 

Nobody outside of the owner and his son knew this was going on and being abhorrently technically incompetent the son did a straight copy/paste of the development code into production and nuked what was live.”

Source: https://www.reddit.com/r/webdev/comments/1vfnr7/deployment_horror_stories/

Cut and Paste

“They … had a deployment script set up to update code to web servers. Around 120 web servers, code was copied via a robocopy command – pulling from one server that was manually updated and copying to the other 119.”

Source: https://www.reddit.com/r/devops/comments/qhqyhl/share_your_devops_horror_stories/

Name Spacing <<< Clusters?

“I worked at a start-up where the DevOps team was extremely inexperienced and ended up creating a kubernetes cluster per client they had instead of name spacing. They had over 100 clients. Each client was pretty much a snowflake in it’s code base as the development team was too inexperienced to implement multi tenant architecture. To add salt to the wound the DevOps team quickly went from 6 people to 2.

Needless to say maintaining production support for 100+ kubernetes clusters with +100 slightly varied application code bases was a bad time. Deploying was even worse.”

Source: https://www.reddit.com/r/devops/comments/qhqyhl/share_your_devops_horror_stories/

You’re Not Alone.

Remember however that a single developer is rarely responsible for breaking Prod entirely on their own. In this case, as with all of the above examples, there are usually systemic problems in an organization that allow that to happen. 

While it’s true that you can’t fix stupid, you can prevent a lot of problems with a strong, secure, resilient deployment process, backed by the right tools. Organizations that focus on the systemic problems, and fix them, realize benefits and reduce risks.

Learn how Continuous Deployment-as-a-Service improves reliability and cuts down on the risks of failed deployments: https://www.armory.io/products/progressive-deployment-strategies/ 

Share this post:

Recently Published Posts

Continuous Deployments meet Continuous Communication

Sep 7, 2023

Automation and the SDLC Automating the software development life cycle has been one of the highest priorities for teams since development became a profession. We know that automation can cut down on burnout and increase efficiency, giving back time to ourselves and our teams to dig in and bust out innovative ideas. If it’s not […]

Read more

Happy 7th Birthday, Armory!

Aug 21, 2023

Happy 7th birthday, Armory! Today we’re celebrating Armory’s 7th birthday. The parenting/startups analogy is somewhat overused but timely as many families (at least in the US) are sending their kids back to school this week. They say that parenting doesn’t get easier with age – the challenges simply change as children grow, undoubtedly true for […]

Read more

Visit the New Armory Developer Portal

Aug 11, 2023

Easier Access to Tutorials, Release Notes, Documentation, and More! Developer Experience (DX) is one of Armory’s top focuses for 2023. In addition to improving developer experience through Continuous Deployment, we’re also working hard to improve DX for all of our solutions.  According to ThoughtWorks, poor information management and dissemination accounts for a large percentage of […]

Read more