When you monitor with logs and have to compute error rate

[Monitoring, Alerting, Logging] Where to use what and why?

Daniel Moldovan
2 min readApr 1, 2021

There are multiple strategies for providing visibility into the health, performance, or quality of your system/application. Visibility needed to quickly detect and fix issues negatively impacting your clients. Visibility obtained through a combination of alerting, monitoring, and logging.

As life is always about compromise, production systems/applications usually choose something between the two strategies below. I prefer the latter, but many times things lean towards the former.

Strategy 1: Whatever, LOL

  • Monitor using logs: Log each request and try to extract metrics using text parsing and regular expressions. Assess application health, performance, or quality from logs. Wonder at your regular expression skills.
  • Monitor using alerts: Alert on absolute values and not on aggregated metrics, e.g. alert if you have > 10 errors per second, and not on error rate. Alert on absolute values instead of relative ones, like when used storage is 139 GBs, and not 80 %. Wonder why alerting volume increases as you add clients and data.
  • Log using alerts: Alert on each error. Alert that something might be wrong when a weird exception occurs. Alert to notify that something failed and it will be retried. Treat every transient failure as a potential problem which needs to be debugged by a person. Wonder why the on-call person has no free time.
  • Rely on client-based monitoring. When your system/application performs badly, you will know as your clients will tell you. Wonder why client attrition is going up.

Strategy 2: Monitor, Alert, Log.

  • Monitor: Instrument your code to expose meaningful metrics which make assessing current behavior easy and fast. Expose metrics about health, performance, load, or quality of service. Use monitoring systems to collect those metrics and capture behavioral trends and patterns over time. Use these trends and patterns as a foundation for system/application improvements.
  • Alert: Use alerting for discovering quickly that your system/application is in a state in which it negatively affects your clients. Attempt to cover the entire system/application health with as few alerts as possible. Alert on aggregated metrics and statistics, such as percentages, error rates over time, percentiles. Tune alerts to minimise false positives and avoid noise.
  • Log: Use logging to aid in tracing problems to their source. When a production problem occurs, it is important to have enough logging in place to quickly trace its root cause. It helps if you can enable on-demand info/debug-level information on production components without needing code changes and/or releases.
  • Use the above together: Alerting notifies about a potential problem. When an alert fires, monitoring should be used to assess the situation quickly and identify the problem. After the problem is identified, logging can used to trace the problem to its source.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Daniel Moldovan
Daniel Moldovan

Written by Daniel Moldovan

Wearing the Site Reliability Engineer and Software Development Engineer hats. Having fun with very large systems.

No responses yet

Write a response