[ALERTING] When to notify people
You build a great product. You offer it as a service. You define quality and performance Service Level Agreements (SLA) for your clients. You define Service Level Objectives (SLO) and Service Level Indicators (SLI) to ensure you fulfill your SLAs. You deploy monitoring. You deploy automation.
And now comes the moment for defining your alerting strategy. When do you need to call an engineer to manually intervene? What is really a problem for your system? On what metrics and values should you alert on? As always, one can take several approaches.
Strategy 1: Alert on all the things and optimize for busy work
- Alert on all metrics: Alert when CPU usage > 80%. Alert when error count > 10. Alert when your queue size > 100. Alert when disk latency > 1s. Alert when there are no requests in the last 5 minutes. Alert that your process or container restarted. Alert that an autoscaling event took place. Wonder why important alerts get missed.
- Alert on instant values for spiky metrics: Alert and page the on-call at 7 AM on a Sunday that CPU usage is at 99%. CPU usage decreases to 50% by the time the on-call team reacts. Alert again after 30 minutes when CPU is again at 99%. Wonder why people hate our scheduled jobs.
- Use fixed alerting thresholds: Do you have 1TB of storage? Alert when 900GB are used. Add extra storage and forget to update alerting. Alert the on-call at 7 AM on a Sunday that 900 GBs out of 5 TBs of storage are used. Wonder why people hate on-call.
- No metric aggregation: Do you have 10 VMs? Get 10 separate CPU usage alerts every time load on your system increases. Do you have a service connecting to a database, horizontally scaled to 10 instances? Get 10 database connection alerts when a database issue occurs, one for each service instance. Wonder why engineers avoid on-call.
- Notify for all alerts: Notify for WARNING alerts, notify for CRITICAL alerts, notify for UNKNOWN alerts. Notify for development and staging environments. Keep the on-call engineers on their toes with constant notification. Wonder why the on-call engineers get desensitized and stop reacting to alerts.
Strategy 2: Alert on the important things and optimize for silence
- Alert on Service Level Indicators: Identify the minimum set of metrics across all your system layers which capture the system’s quality and performance. Alert when those metrics indicate that you are breaching your SLA. Start with metrics closer to the client and work downwards. Alert on client API error rate rather than DB connectivity error rate. Monitor API response time percentiles. Alert when 95'th percentile is over your SLA. When on-call gets notified, there is a 90% chance there is a problem. Wonder at the quick reaction time to pages.
- Aggregate metrics and capture context: Aggregate metrics from different system layers for more accurate alerting. E.g., alert on “response time high and database timings high” rather than “response time high“. For each alert provide additional text-based notifications. The text should contain the context: percentiles, monitoring snapshots, trends. Wonder how your engineers hit the ground running when reacting to alerts.
- Tune alerting levels according to your SLA: Set as CRITICAL or Sev 1 only alerts indicating client impact. For Sev 1 alerts it’s expected that the on-call gets notified and reacts immediately. If something is off but does not impact negatively the customer yet, it can be set as WARNING or Sev 2, and expected that someone will look at it within a couple of hours. Use a less-disruptive notification channel for Sev 2 alerts, like mail. Align alerting levels to your SLA. If you target 99.99 availability, you probably need to set as CRITICAL alerts which can be WARNING for a 99.9 system.
- Optimize for silence: An engineer reacting to a false positive alert still spends time to validate and monitor the alert condition. To reduce false positives, alert on metrics aggregated over time. Alert on increase/decrease rates. Deploy new alerts as WARNING. Monitor their firing rates, and adjust their threshold. Promote them to CRITICAL when you are 90% certain that they will fire only when a production issue occurs. Wonder at the light on-call shifts and high team satisfaction.