SpongeBob SquarePants looking at a rainbow. Text states “code is complete. feature is done”

You need to implement a new feature. You research and select suitable technologies and approaches. You start coding. The required functionality is implemented. You can run and demo the feature locally on your computer.

Is that it? Is the feature done? Can you deploy it in production? Can you move to the next feature?

Of course, it depends. Is the feature part of a university home assignment? Then probably yes. Is the feature part of a larger software as a service product that clients pay for? Then probably no.

Why can’t you just write the code on your computer and…


Image showing Bear Grylls pointing with his finger. Text in the top of the image states “when the operations team is overloaded so you mak everyone part of you operations team”. Text in the bottom of the image states “Improvise. Adapt. Overcome.”

You build a great product. You offer it as a service. You define quality and performance Service Level Agreements (SLA) for your clients.

And now you need to operate it. To ensure when things break, issues are fixed as quickly as possible. To ensure that your service is properly scaled for expected traffic patterns. Basically, to ensure the service is up and provides the best client experience.

You start looking through how to build a great team to operate your product. You know about operations. You have a team of SysAdmin roles maintaining your current IT infrastructure containing code repositories…


Stressed person sweating while choosing which button to press from two options: Alert needed, and Alert not needed

You build a great product. You offer it as a service. You define quality and performance Service Level Agreements (SLA) for your clients. You deploy monitoring.

And now you want to define alerting. To choose relevant metrics. To establish thresholds. Simply, to decide when your system needs manual intervention and you must alert the on-call team.

What metrics to choose for alerting? How many alerts to add? When to alert?

Strategy 1: Alert on everything. We love alerts.

  • Treat all your system’s components/services as pets: Each container or VM instance is important and needs love and gentle care. Alert on each VM restart. Alert when CPU usage on…


Car taking sharp turn to exit a highway. Text in lower part of image states ”Automating operations”. A traffic sign at the top of the image states “IF THEN ELSE” for going forward, and “ML go brrr” if you exit the highway.

You build a great product. You offer it as a service. Your business grows. Your IT systems grow with it. Recurrent problems start appearing. Things that cannot be solved easily or cost-effectively from application code. Corner cases. Programing language limitations. Life happens.

What do you do? You deploy alerting. You call humans to fix things. To restart stuck processes. To scale infrastructure. It works great for a while. Alert fires, human fixes problem. And your infrastructure keeps growing. Things start to break constantly. Humans cannot longer react in time to all issues. Client experience starts to degrade. Clients start to…


Morgan Freeman staring at a wall with many small pictures.

You build a great product. You offer it as a service. You define quality and performance Service Level Agreements (SLA) for your clients. You deploy monitoring to track Service Level Indicators to ensure you fulfill your SLA.

And now you want to create monitoring dashboards. To visualize the metrics you collect and understand your product’s behavior. How should you do that? What dashboards to create? What metrics should you add in each dashboard?

Strategy 1: Yeah, I do not know. We have metrics, we plot metrics

  • All metrics, one dashboard: One image is worth 1 word, so you add 1000 small charts in one dashboard. Wonder why metrics or trends get missed.
  • No…


Anatoly Dyatlov in Chernobyl Miniseries stating: “Response time is one second. Not great not terrible”

You build a great product. You offer it as a service. You define quality and performance Service Level Agreements (SLAs) for your clients.

And now comes the moment for defining your monitoring strategy. What metrics should you collect from your system? From where? At what cadence?

Strategy 1: Yeah, monitoring, I don’t know

  • Use client-based monitoring: No real need to monitor anything. If something does not function properly, your clients will surely let you know. Reduces development costs. Increases client heart rate on occasion.
  • Monitor only system-level metrics: It all boils down to resource congestion, no need to monitor anything…


Toy Story’s Buzz Lightyear stating: ALERTS: ALERTS EVERYWHERE

You build a great product. You offer it as a service. You define quality and performance Service Level Agreements (SLA) for your clients. You define Service Level Objectives (SLO) and Service Level Indicators (SLI) to ensure you fulfill your SLAs. You deploy monitoring. You deploy automation.

And now comes the moment for defining your alerting strategy. When do you need to call an engineer to manually intervene? What is really a problem for your system? On what metrics and values should you alert on? As always, one can take several approaches.

Strategy 1: Alert on all the things and optimize…


The degree of automation required to run a production system is influenced by many factors, such as its scale, domain, or client requirements.

As the system’s scale or complexity grow, operating it becomes more challenging. Things which used to almost never happen become a daily occurrence. If a transaction fails 1 in 10.000 times, and you have 100.000 transactions per day, you have 10 failures a day. This was not a problem when you had 1000 transactions per day.

How do you scale your operations team to handle the increase in system scale and problems? …


When you bundle everything in one deploy, and something fails.

A new feature or bug-fix needs to be deployed to production for it to bring business value and/or improve client experience. This means that a feature is complete only after it is running in production. Of course, how and when to deploy changes in production depends on many things, such as organization culture, domain, or application architecture.

We all assume that a deployment never fails, a bug never makes it in production, and in general, that releases complete without problems. But what if they don’t? What if we should prepare for failure? What if we should prepare to minimize client…


When you monitor with logs and have to compute error rate

There are multiple strategies for providing visibility into the health, performance, or quality of your system/application. Visibility needed to quickly detect and fix issues negatively impacting your clients. Visibility obtained through a combination of alerting, monitoring, and logging.

As life is always about compromise, production systems/applications usually choose something between the two strategies below. I prefer the latter, but many times things lean towards the former.

Strategy 1: Whatever, LOL

  • Monitor using logs: Log each request and try to extract metrics using text parsing and regular expressions. Assess application health, performance, or quality from logs. Wonder at your regular expression skills.
  • Monitor using alerts

Daniel Moldovan

Wearing the Site Reliability Engineer and Software Development Engineer hats. Having fun with very large systems.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store