Stressed person sweating while choosing which button to press from two options: Alert needed, and Alert not needed
Stressed person sweating while choosing which button to press from two options: Alert needed, and Alert not needed

You build a great product. You offer it as a service. You define quality and performance Service Level Agreements (SLA) for your clients. You deploy monitoring.

And now you want to define alerting. To choose relevant metrics. To establish thresholds. Simply, to decide when your system needs manual intervention and you must alert the on-call team.

What metrics to choose for alerting? How many alerts to add? When to alert?

Strategy 1: Alert on everything. We love alerts.


Car taking sharp turn to exit a highway. Text in lower part of image states ”Automating operations”. A traffic sign at the top of the image states “IF THEN ELSE” for going forward, and “ML go brrr” if you exit the highway.
Car taking sharp turn to exit a highway. Text in lower part of image states ”Automating operations”. A traffic sign at the top of the image states “IF THEN ELSE” for going forward, and “ML go brrr” if you exit the highway.

You build a great product. You offer it as a service. Your business grows. Your IT systems grow with it. Recurrent problems start appearing. Things that cannot be solved easily or cost-effectively from application code. Corner cases. Programing language limitations. Life happens.

What do you do? You deploy alerting. You call humans to fix things. To restart stuck processes. To scale infrastructure. It works great for a while. Alert fires, human fixes problem. And your infrastructure keeps growing. Things start to break constantly. Humans cannot longer react in time to all issues. Client experience starts to degrade. Clients start to…


Morgan Freeman staring at a wall with many small pictures.
Morgan Freeman staring at a wall with many small pictures.

You build a great product. You offer it as a service. You define quality and performance Service Level Agreements (SLA) for your clients. You deploy monitoring to track Service Level Indicators to ensure you fulfill your SLA.

And now you want to create monitoring dashboards. To visualize the metrics you collect and understand your product’s behavior. How should you do that? What dashboards to create? What metrics should you add in each dashboard?

Strategy 1: Yeah, I do not know. We have metrics, we plot metrics


Anatoly Dyatlov in Chernobyl Miniseries stating: “Response time is one second. Not great not terrible”
Anatoly Dyatlov in Chernobyl Miniseries stating: “Response time is one second. Not great not terrible”

You build a great product. You offer it as a service. You define quality and performance Service Level Agreements (SLAs) for your clients.

And now comes the moment for defining your monitoring strategy. What metrics should you collect from your system? From where? At what cadence?

Strategy 1: Yeah, monitoring, I don’t know


Toy Story’s Buzz Lightyear stating: ALERTS: ALERTS EVERYWHERE
Toy Story’s Buzz Lightyear stating: ALERTS: ALERTS EVERYWHERE

You build a great product. You offer it as a service. You define quality and performance Service Level Agreements (SLA) for your clients. You define Service Level Objectives (SLO) and Service Level Indicators (SLI) to ensure you fulfill your SLAs. You deploy monitoring. You deploy automation.

And now comes the moment for defining your alerting strategy. When do you need to call an engineer to manually intervene? What is really a problem for your system? On what metrics and values should you alert on? As always, one can take several approaches.

Strategy 1: Alert on all the things and optimize…


The degree of automation required to run a production system is influenced by many factors, such as its scale, domain, or client requirements.

As the system’s scale or complexity grow, operating it becomes more challenging. Things which used to almost never happen become a daily occurrence. If a transaction fails 1 in 10.000 times, and you have 100.000 transactions per day, you have 10 failures a day. This was not a problem when you had 1000 transactions per day.

How do you scale your operations team to handle the increase in system scale and problems? …


When you bundle everything in one deploy, and something fails.
When you bundle everything in one deploy, and something fails.

A new feature or bug-fix needs to be deployed to production for it to bring business value and/or improve client experience. This means that a feature is complete only after it is running in production. Of course, how and when to deploy changes in production depends on many things, such as organization culture, domain, or application architecture.

We all assume that a deployment never fails, a bug never makes it in production, and in general, that releases complete without problems. But what if they don’t? What if we should prepare for failure? What if we should prepare to minimize client…


When you monitor with logs and have to compute error rate
When you monitor with logs and have to compute error rate

There are multiple strategies for providing visibility into the health, performance, or quality of your system/application. Visibility needed to quickly detect and fix issues negatively impacting your clients. Visibility obtained through a combination of alerting, monitoring, and logging.

As life is always about compromise, production systems/applications usually choose something between the two strategies below. I prefer the latter, but many times things lean towards the former.

Strategy 1: Whatever, LOL

Daniel Moldovan

Wearing the Site Reliability Engineer and Software Development Engineer hats. Having fun with very large systems.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store