The secret sauce
In one of my previous roles, I was working as a backend engineer for a microservices based system. We used some/all of these technologies for a given HTTP request -
- Route53 for DNS
- Highly available HAProxy
- A customized Python based web framework
- Managed RDS instances
- Managed ElastiCache
- Highly available Rabbitmq
Can you guess which of the above technologies didn't have any errors?
No. Try again; take a moment.
None of them. All of the above services had occasional errors. Yes, even the managed services that were thousands of US dollars every month. DNS lookup failed, we were unable to connect to database, keylookup failed, publishing to Rabbitmq failed, rendering templates from local disk failed, etc. In fact, we will always have some service failing occasionally in a distributed system. Which is why we need error-reporting to be front and center.
I want to clarify what I mean by error-reporting. To me, it means -
- All stakeholders get real-time/instant notifications of errors in the system.
- The errors are actionable. I.e. there is enough information in error reports to
- Identify the severity (how many users, how often in the last 5 minutes, etc)
- Identify potential solution - not just the exception/stacktrace but also capture the context (variables, etc) when the error happens
In simple terms, the error-reporting system should give me the debugging experience of an IDE (dns lookup failed for cache.example.com on line 42 in src/dns.py) but built for scale of a distributed system.
Many of the solutions out there (especially cloud-providers but also some infrastructure SaaS companies) build error-reporting on top of a logging system. However, there are several problems with this approach (I would even go all the way and say solutions built on-top of logging systems are wrong).
- When an error/exception happens in an application, the application has all the context. It is important for this entire context to be captured for it to be actionable. But logging solutions typically don't capture this context.
- Another problem with this approach is where we are solving the problem. A logging based approach is trying to identify the problem at the infrastructure level (which is why we loose context also) - one level higher than where the error happened.
- Imagine a scenario where we are not shipping logs for sometime (errors will happen). During this period, we are also blind to system-performance. Or we took logging which is generally ok to being buffered/not-time-sensitive to being critical for observability.
Given these arguments, my preferred observability story would look something like this -
A monitoring system like prometheus based on key metrics - number of logins, number of database connections, 50x errors, response times etc. All alerting would happen from this level.
One of the error-reporting system above. When there is an alert, I login here to identify and work with the problem.
A logging solution only for debugging - if a user got banned or there is a payment question from support or a card got declined, etc. Logging is useful for answering questions about one request/transaction/user/payment etc, in my book.
In other words, for a primitive observability story, I only need error-reporting to support a distributed sytem!