What is a Site Reliability Engineer?

A Site Reliability Engineer (SRE) is a role within your company. Many of the concepts this role embodies existed before it was popularized by Google. However, Google did write the books and provide the methods we currently think of for an SRE. Site Reliability Engineers work across many teams and disciplines to bring about the DevOps culture we’ve been talking about for over a decade.

In contrast, DevOps is very much an organizational and cultural shift to break down silos and communication barriers between teams. It doesn’t speak to a high degree of the methods and techniques of Operations. SRE doesn’t speak to the organizational structure. However, SRE can be a method by which an entity implements a DevOps culture. In practice, if a company has DevOps Engineers there are lots of questions to ask before accepting a job offer from that company. While an SRE role can mean many things and be implemented differently to suit the organization, it is, indeed, a known set of practices. It is a codification of our profession.

I describe Site Reliability Engineering as being able to apply the Scientific Method to Operations work.

Make an observation. Often this comes from analyzing your telemetry about a process or application. Or instrumenting that process.
Form a hypothesis. To make an application or process better theorize about the specific change that should be made.
Note the prediction. What changes in the initial observation are expected from this change.
Experiment. Make those changes and ensure that build process and integration process complete normally. Deploy the change.
Analyze the results. Confirm how the initial telemetry changed and if if changed as expected or not. Use the information gained to repeat the process to seek further improvements.

This analogy works well with those not familiar with the SRE principals as laid out in Google’s books. In general, the point is that we apply scientific and engineering rigor to this profession. For example, a favored method to deploy a new application version into production is an A/B deployment. Group A is unchanged and becomes the control group while Group B is upgraded and becomes the experimental group. At this point the differences can be actively measured and compared to see if B exhibits desired results. This is much like how drug trials are run to prove the efficacy of a new medication and this follows a prescribed procedure.

These guiding principals are as follows. These should sound familiar as they come from Google’s books.

Embrace Risk: Risk is measured and this allows SREs to gain understanding of how much risk is in play. Risk is no longer an unknown.
Service Level Objectives: The Service Level Objective (SLO) is the common yard stick by which we measure. While each application is different and is measured differently this abstraction enables an SRE to easily compare the reliability of each service in a standard way.
Eliminate Toil: Not only is automation part of this job, its foundational to enabling scalability. SREs commonly own and manage Continuous Integration and Continuous Delievery systems as this automates the workflow of testing, building, and deploying software. Without an automated workflow it is nearly impossible to perform the repeated testing needed for control groups and experimental groups.
Monitoring and Telemetry: Without good data a scientist cannot produce consistent results that can be verified by their peers. This is true of SRE as well. Therefore, its common to see SREs owning the Observability stack to enable gathering and automating the rich data collected from applications.
Release Engineering: Building and automating the process by which software is deployed and managed through its lifetime. Including what seems like a lost art: Configuration Management.
Simplicity: Best explained in the lore of the Slackware Linux distribution: Keep It Simple, Stupid. Modern software and surrounding systems are especially complex – and more so in a micro services style architecture. When new tools and techniques are discovered that simplify part of the process these are measured and when proven are adopted. Cognitive Burden will always be present, but keeping it contained is the goal of an SRE.

One particular note about SRE is that it was developed to support and help scale a micro services architecture. This is expressly not a requirement for, or a product of, SRE. Micro services and Service Oriented Architectures are a social technique for scaling large infrastructures. There is much to praise about them, but they also introduce complexities that must be planned and staffed for. Distributed applications aren’t easy. However, the principles and methods of Service Reliability Engineering bring definition to the field of Operations that have been lacking for many years. It can and will enable higher velocities with measured continuous improvement. SRE requires communication across teams to create Service Level Objectives and this works to break down silos. A primary goal of the DevOps culture.

This is what a Service Reliability Engineer does. We are engineers and we do real engineering. Real engineering requires mathematical models and scientific rigor to show verifiable results. If you haven’t done any math today in your tech job, are you really working to bring about improvement, success, and a DevOps culture?

This is the expertise I bring to a team. Are you interested in seeing what consulting can do for your business? Email me.

LinuxCzar

What is a Site Reliability Engineer?