[AUTOMATION] Scale operations with people or with code?

The degree of automation required to run a production system is influenced by many factors, such as its scale, domain, or client requirements.

As the system’s scale or complexity grow, operating it becomes more challenging. Things which used to almost never happen become a daily occurrence. If a transaction fails 1 in 10.000 times, and you have 100.000 transactions per day, you have 10 failures a day. This was not a problem when you had 1000 transactions per day.

How do you scale your operations team to handle the increase in system scale and problems? What strategies ensure you maintain quality of service and client experience as the scale grows?

Strategy 1: Scaling operations with people

  • Human-based scaling: You have a traffic increase? Alert the on-call person to eyeball the situation and scale up/out by hand. Forget or be afraid to scale down. Wonder at your infrastructure costs.
  • Human-based recovery: Have detailed wikis with manual operations and big warnings for potential problems. Expect a stressed on-call person trying to recover from an incident to follow them without mistake. Wonder why mistakes get done.
  • Linear relationship between system size and operations team size: Assign one person per 100 customers, one person per 1000 component instances, etc. Wonder at your team size and large duration of team meetings.
  • All hands on deck for each incident: If it takes 2 minutes to manually scale one of your clusters, and you need to scale 180, you have 6 hours of work to distribute. Call all team members to help you recover faster. Wonder why team members feel burned out.
  • Adopt and love toil work: Ensure each team member does the same thing manually every day. Grow a team of people that cannot improve as they are always busy babysitting your system. Wonder at your low team morale.

Strategy 2: Scaling operations with code

  • Automatic load balancing and scaling: Build automation to scale infrastructure up/out and back down/in, with reasonable limits. Notify on-call person only in exceptional cases, like if scaling limits are reached, or scaling failed. Wonder at how fast you can adjust to traffic variations.
  • Auto-recover: Build automation to handle recovery and remediation scenarios. Consider the on-call person as a supervisor, being there to handle only the cases not covered by automation, or recover from automation failures. Wonder at your low time to recovery from incidents.
  • Scale automation with your system: As your scale grows, automation needs get more complex, and the development skill needed by the team writing the automation. Grow an engineering team focused on operational excellence. Wonder at your high quality of service.
  • Use people to bring business value, not babysit the system: Understand the business and the domain. Determine where client experience or service quality need improvement. Use your team to implement those improvements. Wonder at your high client satisfaction.
  • Treat automation code as application code: Design automation. Develop automation. Test automation. Trust automation.