When you monitor with logs and have to compute error rate

[Monitoring, Alerting, Logging] Where to use what and why?

2 min readApr 1, 2021

There are multiple strategies for providing visibility into the health, performance, or quality of your system/application. Visibility needed to quickly detect and fix issues negatively impacting your clients. Visibility obtained through a combination of alerting, monitoring, and logging.

As life is always about compromise, production systems/applications usually choose something between the two strategies below. I prefer the latter, but many times things lean towards the former.

Strategy 1: Whatever, LOL

Monitor using logs: Log each request and try to extract metrics using text parsing and regular expressions. Assess application health, performance, or quality from logs. Wonder at your regular expression skills.
Monitor using alerts: Alert on absolute values and not on aggregated metrics, e.g. alert if you have > 10 errors per second, and not on error rate. Alert on absolute values instead of relative ones, like when used storage is 139 GBs, and not 80 %. Wonder why alerting volume increases as you add clients and data.
Log using alerts: Alert on each error. Alert that something might be wrong when a weird exception occurs. Alert to notify that something failed and it will be retried. Treat every transient failure as a potential problem which needs to be debugged by a person. Wonder why the on-call person has no free time.
Rely on client-based monitoring. When your system/application performs badly, you will know as your clients will tell you. Wonder why client attrition is going up.

Strategy 2: Monitor, Alert, Log.

Monitor: Instrument your code to expose meaningful metrics which make assessing current behavior easy and fast. Expose metrics about health, performance, load, or quality of service. Use monitoring systems to collect those metrics and capture behavioral trends and patterns over time. Use these trends and patterns as a foundation for system/application improvements.
Alert: Use alerting for discovering quickly that your system/application is in a state in which it negatively affects your clients. Attempt to cover the entire system/application health with as few alerts as possible. Alert on aggregated metrics and statistics, such as percentages, error rates over time, percentiles. Tune alerts to minimise false positives and avoid noise.
Log: Use logging to aid in tracing problems to their source. When a production problem occurs, it is important to have enough logging in place to quickly trace its root cause. It helps if you can enable on-demand info/debug-level information on production components without needing code changes and/or releases.
Use the above together: Alerting notifies about a potential problem. When an alert fires, monitoring should be used to assess the situation quickly and identify the problem. After the problem is identified, logging can used to trace the problem to its source.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Written by Daniel Moldovan

84 Followers

51 Following

Wearing the Site Reliability Engineer and Software Development Engineer hats. Having fun with very large systems.

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

More from Daniel Moldovan

How to be a staff engineer. Or team lead. Or principal engineer. Or manager. Or whatever …

Daniel Moldovan

How to be a staff engineer. Or team lead. Or principal engineer. Or manager. Or whatever …

Many wonder what does it take to become a senior software engineer. Or staff engineer. Or principal software engineer. Or manager. Or …

Jan 11, 2024

223

Daniel Moldovan

Common software environments explained

Software deployment environments: an informal definition

Oct 20, 2022

Morgan Freeman staring at a wall with many small pictures.

DevOps Dudes

Daniel Moldovan

[MONITORING] How to build your monitoring dashboards?

You build a great product. You offer it as a service. You define quality and performance Service Level Agreements (SLA) for your clients…

May 23, 2021

SpongeBob SquarePants looking at a rainbow. Text states “code is complete. feature is done”

DevOps Dudes

Daniel Moldovan

[CLOSED] When is a feature done? When are you ready to deploy a new feature to production?

You need to implement a new feature. You research and select suitable technologies and approaches. You start coding. The required…

Sep 9, 2021

127

See all from Daniel Moldovan

Recommended from Medium

OpenTelemetry Collector : A Gateway to Modern Observability

Dev Genius

Rahul Ranjan

OpenTelemetry Collector : A Gateway to Modern Observability

What is Opentelemetry Collector?

Sep 25, 2024

The 5 paid subscriptions I actually use in 2025 as a Staff Software Engineer

Level Up Coding

Jacob Bennett

The 5 paid subscriptions I actually use in 2025 as a Staff Software Engineer

Tools I use that are cheaper than Netflix

Jan 7

10.6K

260

Lists

Staff picks

827 stories1648 saves

Natural Language Processing

1977 stories1620 saves

Spring Says Goodbye to @Autowired: Here’s What to Use Instead

Java Interview

Spring Says Goodbye to @Autowired: Here’s What to Use Instead

Yes, starting with Spring Boot 3 and Spring Framework 6, Spring has been encouraging constructor-based dependency injection over field…

Feb 21

331

Stackademic

Crafting-Code

I Stopped Using Kubernetes. Our DevOps Team Is Happier Than Ever

Why Letting Go of Kubernetes Worked for Us

Nov 19, 2024

5.8K

173

How I Review Code As a Senior Developer For Better Results

Vinod Pal

How I Review Code As a Senior Developer For Better Results

I have been doing code reviews for quite some time and have become better at it. From my experience here I have compiled a list of…

Jan 25

1.5K

How I Am Using a Lifetime 100% Free Server

Harendra

How I Am Using a Lifetime 100% Free Server

Get a server with 24 GB RAM + 4 CPU + 200 GB Storage + Always Free

Oct 26, 2024

9.4K

170

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams