When you bundle everything in one deploy, and something fails.

[DEPLOY] Fail fast. Recover faster.

Daniel Moldovan
2 min readApr 1, 2021

A new feature or bug-fix needs to be deployed to production for it to bring business value and/or improve client experience. This means that a feature is complete only after it is running in production. Of course, how and when to deploy changes in production depends on many things, such as organization culture, domain, or application architecture.

We all assume that a deployment never fails, a bug never makes it in production, and in general, that releases complete without problems. But what if they don’t? What if we should prepare for failure? What if we should prepare to minimize client impact in case anything goes wrong? What if we should make client experience the top priority of any deployment? How would that impact our deployment strategy? What could we do?

Strategy 1: Whatever, death march, LOL

  • Maximize customer impact: Bundle all changes into one deploy. Deploy all components at once. Deploy to all customers at once. Deploying the change everywhere makes it more likely you will notice any problems with it. If not, your clients will notice.
  • Maximize problem detection time: Manually validate that the change was successful. Wait for clients to execute certain flows to see if they work as expected. Have a validation cycle of 24 hours or more, time for scheduled jobs to execute. Deploy on a Friday, have clients starting to report problems Monday.
  • Maximize time to recovery: Do not prepare a roll-back plan. Rely on roll-forward policies: detect problem, implement fix, deploy fix. Let the system unusable or broken until you deploy the fix. Ask clients to be patient while you are busy working on the fix. Work late nights and weekends to complete the fix faster.

Strategy 2: Prepare for failure. Maximize client experience.

  • Minimize blast radius: Deploy as few changes as possible to as few components as possible. Deploy often. Deploy to a subset of production instances or clients if possible. Minimize the set of clients affected when a problem occurs.
  • Minimize detection time: Implement end to end tests and other automated mechanisms to validate changes. Automatically validate both common and less common flows. Instrument your system with failure rates and alert on abnormal behavior. Detect problems before your clients.
  • Minimize time to recovery: Have quick rollback procedures in place. Instantly rollback a problematic change. Fix and test outside production. Redeploy. Minimize the time your systems remains in a bad state.
  • Keep the client happy: Deploy small. Deploy often. Validate fast. Rollback fast.

--

--

Daniel Moldovan

Wearing the Site Reliability Engineer and Software Development Engineer hats. Having fun with very large systems.