Anatoly Dyatlov in Chernobyl Miniseries stating: “Response time is one second. Not great not terrible”

[MONITORING] What to monitor? Where? When?

You build a great product. You offer it as a service. You define quality and performance Service Level Agreements (SLAs) for your clients.

And now comes the moment for defining your monitoring strategy. What metrics should you collect from your system? From where? At what cadence?

Strategy 1: Yeah, monitoring, I don’t know

  • Use client-based monitoring: No real need to monitor anything. If something does not function properly, your clients will surely let you know. Reduces development costs. Increases client heart rate on occasion.
  • Monitor only system-level metrics: It all boils down to resource congestion, no need to monitor anything else. If your system is slow, check if CPU or DISK are congested. If your system returns a lot of errors, check if CPU or DISK are congested. Wonder why your clients catch more problems than your monitoring.
  • Monitor only application-level metrics: Track application response times and error rates. No need to monitor infrastructure or external services. If your system returns a lot of errors or starts to slow down, start digging into error logs, or check local metrics from individual infrastructure components. Wonder why your recovery time is high.
  • Monitor all the things: Collect all metrics you can get your hands on, from any system level, with any granularity. Focus first on collecting everything. Determine metric relevance later. Generate a metric flood. Wonder why your monitoring system is slow or expensive to maintain.
  • Do not aggregate metrics: Each metric is independent. CPU metrics are labeled only with infrastructure information. Response time is labeled with the API endpoint. Let humans piece together the links between CPU and response time for a particular endpoint. Wonder why you can’t increase your system’s automation or implement auto-remediation.

Strategy 2: Monitoring for observability

  • SLA, SLO, SLI: Define the Service Level Objectives (SLO) needed to fulfill your SLA. From the SLOs, extract your required set of Service Level Indicators (SLI). Tailor monitoring around these SLIs. Does your SLA require 99.99 uptime? Then 99.99 uptime is your SLO. And SLIs are your metrics tracking uptime. Not all metrics should be SLIs. Identify which metrics actually matter, and focus on collecting those effectively.
  • Collect business level metrics: Collect metrics such as transactions, purchases, end-to-end business flows. They provide an overall view over system behavior and help assess business impact if anything goes wrong.
  • Collect end-to-end tests’ results: Implement end-to-end tests capturing as many business flows as possible. Run them continuously or as often as possible. Monitor their results. Use them as early indicators for any problems with your system.
  • Collect application level performance and quality metrics: Collect metrics like response time, latency, error rate. Aggregate them across service instances. Extract from metrics percentiles, such as 95th percentile. Rely on them to understand your performance and quality distribution across requests.
  • Collect metrics about interactions with external services: External can be just another system component, or a service offered by a third-party entity. Collect metrics such as response times and error rates in communication with external services. They are important in detecting degradation in the performance and quality provided by external services.
  • Collect infrastructure utilization and performance metrics: Collect metrics to help you quickly identify infrastructure congestion. Like CPU usage, DISK usage and latency, MEMORY usage, NETWORK usage and packet loss.
  • Aggregate metrics across monitoring layers: Correlate application level with business level and infrastructure level metrics. Label metrics with source component and layer, to enable metric aggregation. Aggregating metrics across layers can help you understand problems quicker and alert on more complex events. It’s usually more useful to alert on “response time is high and cpu usage is high on service X”, instead of the simpler “response time is high on service X”, or the more noisy and usually less useful “CPU usage is high”.

Wearing the Site Reliability Engineer and Software Development Engineer hats. Having fun with very large systems.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store