[AUTO-REMEDIATION] How to automate operations in large-scale systems?

4 min readJun 11, 2021

Car taking sharp turn to exit a highway. Text in lower part of image states ”Automating operations”. A traffic sign at the top of the image states “IF THEN ELSE” for going forward, and “ML go brrr” if you exit the highway.

You build a great product. You offer it as a service. Your business grows. Your IT systems grow with it. Recurrent problems start appearing. Things that cannot be solved easily or cost-effectively from application code. Corner cases. Programing language limitations. Life happens.

What do you do? You deploy alerting. You call humans to fix things. To restart stuck processes. To scale infrastructure. It works great for a while. Alert fires, human fixes problem. And your infrastructure keeps growing. Things start to break constantly. Humans cannot longer react in time to all issues. Client experience starts to degrade. Clients start to notice. What do you do?

You consider automating the repetitive manual work needed to maintain your IT systems running. But how to automate? What approach to use?

Strategy 1: Yeah, machine learning, cool

Pick the hottest ML framework: If you don’t know what you are looking for, the best bet is to use machine learning to search for it. You do not know when to enforce a particular action? The latest ML framework can find out. It’s not called unsupervised learning for nothing. Wonder why it seems you can’t able to trust your automation.
Neural networks are cool: They can predict everything. Feed all your metrics into neural networks. Run predictions. You just need a 196GBs machine to analyze in 30 minutes one metric. Wonder why you can’t get analysis results in due time for all system components.
Ignore context: Focus on single signals or metrics for deciding on auto-remediation actions. When one metric indicates a problem, automatically execute remediation actions. If the metric is spiking, execute those actions again and again. Wonder why you get a flood of auto-remediation when a widespread issue impacts your system.
Build auto-remediation as complex as application code: Automation should cover everything. Implement very complex auto-remediation flows. Use complex machine learning frameworks for determining the proper auto-remediation actions. Build complex dependency graphs between your auto-remediation actions. Wonder why your automation is hard to debug. Realize you might need some automation for your automation.

Strategy 2: Explainable. Context-aware. Reliable. Safe.

Explainable: Pick a machine learning approach that is explainable. It should make it easy to debug and validate things, which increases trust in your system. Rule-based expert systems can be a great choice. You have the “expert knowledge”. From all documented standard operating procedures. You know the rules. You know what to do and when. Automate that. In this way, when automation does something, you know exactly why. Bugs are easy to track down and fix. Wonder how fast your team is iterating and automating more things.
Context-aware: Build your automation context-aware. Metrics from upstream/downstream or third-party components provide great context. Context is crucial in ensuring complex events don’t get overlooked. Such as widespread issues, or issues caused by upstream/downstream components. Do you have automation to scale your system based on throughput? Then capture as context if the underlying storage is slow or not. There is no point in scaling up/out your processing components if you have slow storage. Wonder how well your automation reacts to problems.
Simple: Anyone in your operations team should be able to extend and improve the automation. Any new automation feature should be easy to implement without advanced programming knowledge. It should be simple in terms of not only code, but also accompanying technology stack. Do you need persistence? SQLLite can work just fine. Wonder how easy is to test, improve, and debug your automation.
Reliable: Software that automates operations needs to handle gracefully any type of failure. It needs to recover from failures and continue executing. Choose fault-tolerant execution environments. Avoid using long-lived processes. Those bring extra problems: stuck processes, memory leaks, etc. A simple scheduled job can provide excellent fault tolerance. If the job fails, it restarts at the next schedule. Store each action for which an execution starts, so as not to re-execute it after an automation crash. Wonder how maintenance-free your automation is.
Guardrails: You want to control when and what automated actions are executed. Provide guard rails limiting the automation to certain tested scenarios. Ensure the guardrails stop the automation if there are too many problems, or if the issue is too severe. Automation that covers 80% of the normal cases, leaving 20% to a human to intervene, is still extremely useful. And it ensures it does not mess up your system even more. Provide automation disable controls. Allow automation to be enabled/disabled for particular customers or critical components. Allow a human to disable your automation during an emergency or unforeseen event. Wonder how much your team trusts automation.
Cooldown: You might not want certain actions to be executed too often. Like scaling up/down too often. You might not want to perform certain expensive analysis processes too often. Like complex predictions. Implement a storage mechanism to record the date and time when a remediation action or data analysis process was executed. Use it to control the “cooldown” time in which an action or analysis is not done again. Allow users to configure the cooldown individually for each action and analysis process. This enables you to safely deploy expensive auto-remediation actions, such as infrastructure scaling. Wonder how stable is your automation.

[AUTO-REMEDIATION] How to automate operations in large-scale systems?

Strategy 1: Yeah, machine learning, cool

Strategy 2: Explainable. Context-aware. Reliable. Safe.

Written by Daniel Moldovan