You build a great product. You offer it as a service. You define quality and performance Service Level Agreements (SLA) for your clients. You deploy monitoring.
And now you want to define alerting. To choose relevant metrics. To establish thresholds. Simply, to decide when your system needs manual intervention and you must alert the on-call team.
What metrics to choose for alerting? How many alerts to add? When to alert?
You build a great product. You offer it as a service. Your business grows. Your IT systems grow with it. Recurrent problems start appearing. Things that cannot be solved easily or cost-effectively from application code. Corner cases. Programing language limitations. Life happens.
What do you do? You deploy alerting. You call humans to fix things. To restart stuck processes. To scale infrastructure. It works great for a while. Alert fires, human fixes problem. And your infrastructure keeps growing. Things start to break constantly. Humans cannot longer react in time to all issues. Client experience starts to degrade. Clients start to…
You build a great product. You offer it as a service. You define quality and performance Service Level Agreements (SLA) for your clients. You deploy monitoring to track Service Level Indicators to ensure you fulfill your SLA.
And now you want to create monitoring dashboards. To visualize the metrics you collect and understand your product’s behavior. How should you do that? What dashboards to create? What metrics should you add in each dashboard?
You build a great product. You offer it as a service. You define quality and performance Service Level Agreements (SLAs) for your clients.
And now comes the moment for defining your monitoring strategy. What metrics should you collect from your system? From where? At what cadence?
Strategy 1: Yeah, monitoring, I don’t know
You build a great product. You offer it as a service. You define quality and performance Service Level Agreements (SLA) for your clients. You define Service Level Objectives (SLO) and Service Level Indicators (SLI) to ensure you fulfill your SLAs. You deploy monitoring. You deploy automation.
And now comes the moment for defining your alerting strategy. When do you need to call an engineer to manually intervene? What is really a problem for your system? On what metrics and values should you alert on? As always, one can take several approaches.
Strategy 1: Alert on all the things and optimize…
The degree of automation required to run a production system is influenced by many factors, such as its scale, domain, or client requirements.
As the system’s scale or complexity grow, operating it becomes more challenging. Things which used to almost never happen become a daily occurrence. If a transaction fails 1 in 10.000 times, and you have 100.000 transactions per day, you have 10 failures a day. This was not a problem when you had 1000 transactions per day.
How do you scale your operations team to handle the increase in system scale and problems? …
A new feature or bug-fix needs to be deployed to production for it to bring business value and/or improve client experience. This means that a feature is complete only after it is running in production. Of course, how and when to deploy changes in production depends on many things, such as organization culture, domain, or application architecture.
We all assume that a deployment never fails, a bug never makes it in production, and in general, that releases complete without problems. But what if they don’t? What if we should prepare for failure? What if we should prepare to minimize client…
There are multiple strategies for providing visibility into the health, performance, or quality of your system/application. Visibility needed to quickly detect and fix issues negatively impacting your clients. Visibility obtained through a combination of alerting, monitoring, and logging.
As life is always about compromise, production systems/applications usually choose something between the two strategies below. I prefer the latter, but many times things lean towards the former.
Wearing the Site Reliability Engineer and Software Development Engineer hats. Having fun with very large systems.