Logging and Eventing with the SEARCH Method
When working with Software Engineering teams to improve their observability, I often find that working with a method helps tremendously. A method with a short and catchy acronym really drives the lessons home. Soon I’ve got management writing notes about my acronyms and including them in planning meetings. Methods have a winning madness. Now let’s use them for logging too.
Starting Teams with the Four Golden Signals
Usually, the first steps I take with teams is getting them to instrument their code with some standard metrics library. The right library is context sensitive to fix the situation. But how to use it is not. I ask teams to make sure that they identify the 4 Golden Signals and make sure that the applications have a metric for each. This means we have the following:
- Traffic: A Counter type metric that is incremented for each unit of work.
- Errors: A Counter type metric that is incremented for each unit of work that failed, had an exception, etc.
- Duration: A Histogram type metric (or other Timer type) that builds a distribution of the time it took to complete each unit of work.
- Saturation: How full an application is, or what signal relates to when that application may need to be scaled up or down. This may be CPU used, RAM used, threading, work units waiting in a queue, etc.
By the time a team has instrumented and identified the 4 Golden Signals (4GS) they have learned about any auto-instrumentation available and have likely added any number of additional metrics the team thinks might be appropriate. From this point the team can work with dashboarding and building Service Level Objectives (SLOs). Suddenly, the team has a measure of how well they have achieved their goals, anomaly detection on their service, and a way to manage upwards on team expectations depending on their SLOs.
There are other methods commonly used with metrics. Such as RED and the USE methods. These prove the point that methods of instrumentation work well and well enough that they are continually being adopted, improved, and iterated on.
Next, I work with the team to improve their logging. Or perhaps the team is so log driven that those conversations have already begun while creating 4GS metrics. Here is where I would wave my hands about Structured Logging and Eventing tossing in some JSON examples. But this seems much harder for teams to grasp – to make the paradigm shift from grepping (come on now, that really is what you are doing with Splunk) to analysis and data warehousing.
Search as one might, there is no handy method for logging and eventing. The Google SRE books do mention logs in a hand-wavy way but don’t cover them in depth. It is as if the goal is to move away from logging entirely. (Wouldn’t that be great?) But in all of the businesses and Software Engineers that I’ve worked with, they still need logs. Business and compliance reporting must to come from logs. Indeed, each unit of work handled should leave behind an immutable, structured, audit log. This is where the Google SRE books stop.
Introducing the SEARCH Method
When teams show progress in instrumenting code to elucidate the Four Golden Signals then I shift to improving the existing logging. The same questions are here that we covered for metrics. How does the team use Structured Logging? What should the application actively be logging? What is important to find in those logs?
Let’s introduce the SEARCH Method for Structured Logging and Eventing. The name SEARCH is key because that’s the goal we have with log and event data. It is important that it is searchable. We need to be able to search for errors. Search to build powerful analytics. Search to find an exact customer that complains that the Service Level Agreement (SLA) they signed is being broken. Search to build business intelligence.
Let’s break down the acronym.
S: Structured Schema
Each log event is structured with several fields. Very likely using JSON as
the encoding. These log events follow a lose schema across the company. For
example, a customer_id
field would have the same field name from any
application allowing one to easily search for all interactions with that
customer across all microservices. Schemas are extended by teams by storing
service specific fields in owned sub-hashes in the JSON structure. The
Elastic Common Schema (ECS) is a great place to base a schema from.
{
"time": "2022-04-11T01:04:35.343Z",
"srv": "demo-application",
"hos": "ordinary",
"thr": "scheduling-1",
"level": "ERROR",
"logger": "j.custommetricsdemo.Scheduler",
"mdc": {
"uid": "jjneely",
"jid": "job-9876",
"tid": "1234",
"cid": "DB93F282-5559-49B8-9BBB-F24E0086FE14",
"trace_id": "b38db9c926d2f21cffba8afe9eae3d1b",
"trace_flags": "01",
"span_id": "d35e9a2cc441d8af"
},
"msg": "Exception",
"stack_trace": "j.lang.IllegalArgumentException: max must be greater than min\n\tat j.custommetricsdemo.Scheduler.getRandomNumberInRange(Scheduler.java:156)\n\tat j.custommetricsdemo.Scheduler.wait(Scheduler.java:143)\n\tat j.custommetricsdemo.Scheduler.schedulingTask(Scheduler.java:112)\n\tat j.i.r.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java)\n\tat j.i.r.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat j.i.r.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:566)\n\tat o.s.s.s.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84)\n\tat i.o.j.i.s.s.SpringSchedulingRunnableWrapper.run(SpringSchedulingRunnableWrapper.java:35)\n\tat o.s.s.s.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)\n\tat j.u.c.Executors$RunnableAdapter.call(Executors.java:515)\n\tat java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)\n\tat j.u.c.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)\n\tat j.u.c.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat j.u.c.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\t... 1 frames excluded\n",
"stack_hash": "1642679d"
}
This JSON log event from my demo-application is an example of a structured
event showing an exception. Using a standard library to format log messages
allows common configuration to share and basics like the above error handling
to be uniform across the application stack. Such as all applications using
stack_trace
for the stack trace and providing a hash of that stack trace in
stack_hash
. The above is formatted with LogBack.
This also removes parsing from the log pipeline and moves field type or parsing issues to the team’s code. Which means a parsing or typing error is a code fix and is addressable by the local team. Rather than the Observability Team having to code rules in LogStash for every way an error might be represented across all applications. However, LogStash or the log pipeline tool of choice will easily be able to annotate each message with additional namespacing information. This encodes the, for example, Kubernetes cluster, Namespace, and Pod the log event came from in a standard way.
With a reasonably thought out schema users will find that building regular expressions to search log events also become much easier and simpler. Or, ideally, a thing of the past.
E: Errors and Status
Each log even should have a standard field holding the status of the event. The classic example would be the HTTP Status Code for the API request. The goal is to classify each event as successful or some sort of error. This allows robust searching combining fields with an error condition. Find all events where this specific customer had a failure. A suggestion here is to make this more than a Boolean field. HTTP Status Codes are a great model to build on. However, this could be as simple as customizing log levels to a local standard.
A: Audit the Action
What action was performed? What verb? A GET
or POST
API request? Each
event should record the action taken on behalf of the client.
R: Resources and Users
What resource was this action done on behalf of? What resources where retrieved or affected by this action? This builds an audit trail that will be needed to form business intelligence reports.
C: Chronicle the Event
The logging library used will affix a timestamp to the log event. If it doesn’t go get a better library. However, the immutable record of this event should contain the duration, in seconds, this event took. This provides a full accounting of when and how long it took to process this event or perform some action on behalf of a user.
H: Human Readable
There are many purposes these logs will serve. Often overlooked, and
incredibly important is that these logs will help onboard new team members,
SREs, and anyone on-call for this service. Each log event should include a
human readable message aimed at providing context and assisting a new engineer
to get up to speed on this application. This is what would typically live in
the message
or msg
field.
Conclusion
This helps the teams I work with start to reason about what log messages should look like in a data driven, data science like world. It provides a memorable acronym and a template of what a log message should contain. How to achieve this is likely context sensitive. It depends on the languages applications are written in, what instrumentation libraries are supported, and agreeing on a flexible schema that handles most of the use cases. Not all companies have the resources to make Node applications log with identical behavior to Java applications, but some do. Find the place in the middle of the spectrum that brings value to your ability to SEARCH.
See something that needs improving or iterating to make this better? Let me know in the comments!