Tenets of Microservice Monitoring

I was able to attend Monitorama PDX 2017 this past week and had a blast. If you are interesting in monitoring, metrics, related data analysis, alerting, and of course logs then this is the conference for you. It struck me at Monitorama that many of us came of age in the pre-microservices (Service Oriented Architecture or SOA) world. But services in a SOA environment are different and should be monitored differently than what we may be used to.

Here’s my take on the basic tenets of monitoring infrastructure and best practices for Service Oriented Architectures. I’m an Operations Engineer doing visibility work for a fairly large client so this comes from the viewpoint of the caretaker of monitoring services. If you are a developer and don’t agree, let me know!

Innovation and Experimentation is Paramount

Modern software development practices value very highly the ability to quickly experiment or innovate. This may lead to new ways of thinking, better ways to accomplish a task, and even new products to market. Its a core belief of the DevOps movement to allow continual learning.

How does monitoring, or better, “Observability” fit into this? Software engineers require the ability to observe events happening and build alert rules as needed. However, we must not require of them to become experts in the monitoring platform of choice. Observability infrastructure is not part of a software engineer team’s service, it is its own service. This infrastructure needs to actively move out of the way of progress and, rather, enable it.

Good UI tools (and tools in general) can help guide one to making good decisions when setting up their alert rules as well as serve as unit tests.

Alert Fatigue

Voraciously attack the causes of alert fatigue. This is nothing new, but cannot be left unsaid. Alert fatigue stands in the way of experimentation, innovation, and, frankly, your career. A rule of thumb: 10 or more pages a week is a horrific on call cycle and indicates there is much work to do.

As one scales, you will see that anomalies actually are abundant. If observations outside of $4\sigma$ (standard deviations) are anomalous, then 1 observation in 15,787 is an anomaly. That’s going to be a frequent occurrence. Anomalies do not equate to alerts!

Use alerts as part of your feedback mechanism. If a team member was woken up at 3:07 AM, that alert should include a link to a short survey or similar that they can fill out at their leisure. Was this alert actionable? Should an alert rule be changed or tuned? What can the monitoring folks do to help?

Software Engineers On Call

Software Engineers take part in an on call rotation and take responsibility for their services. Support them in this by all means. Keep alert fatigue a non-concern. Teach them to practice reliability.

Observability Unification

Metrics? Alerts? Make these standard across all SOA services. There should be a kit or collection of common code each service uses as to not re-invent the wheel. If one team creates a new metric to measure latency, then every team should use the same metric to measure latency. Log formats should also make use of the same unification. Services will have custom metrics and logs, but be sure that common data is measured with the same metric names.

This allows you to build and automate dashboards. This allows the creation of a set of default alerts for each service. Good tools make for versioned dashboards and alerts that deploy with a service.

Enable Data Analysis

The amount of data analysis routinely done will vary, but at some point data from the monitoring systems will be used to reach or support a set of conclusions. Ensure that the data used to reach decisions is transparent all the way down the stack. If you are building conclusions based on percentile data, know how those percentiles were calculated, enable the analysis of that data, and other data from the same sources. Others should be able to verify your data and conclusions just like the broader scientific community.

Enable monitoring data to be easily fed into R.

Observability vs Predicting the Future

Visibility depends on a developer’s ability to predict the future and instrument all possible causes of future bugs and issues. Observability is a framework by which any event or aspect of the code can be traced and measured.

As Operations folks, we ask a lot of the software engineers we work with. Please instrument your code for metrics, be sure to use structured logging, and when you are done please instrument with an OpenTracing compatible library. It turns out predicting the future is quite complex. What can we do to remove the complexity of all this instrumentation while providing observability into these services?

The Standard Sensor Format was created to start addressing the work we have here. But our goal is to start removing this complexity from software engineers’ day to day lives.

Safe Space Culture

Finally, work to maintain and improve your company’s culture. Creating a good culture is hard, and fixing a poor culture requires even more effort. But in order to work well between teams, you must build trust. This must be a safe space to work, learn, and share any idea. I would even encourage folks to develop a set of rules to stipulate what is acceptable behavior. Gently remind the group or individuals when the line is crossed. Maintaining a safe work space is part of everyone’s effort to build a maintain a good company culture.

LinuxCzar