LinuxCzar

Engineering Software, Linux, and Observability. The website of Jack Neely.    

A Site Reliability Engineer Series

  June 23, 2020   Operations   sre

Are you a Site Reliability Engineer? Perhaps your team is looking to start an SRE journey or expand their resources on hand. Or maybe you are looking to improve your skill set. We’re all engineers, and its about time we actually use mathematics to prove our designs and practices. We didn’t get to the Moon by winging it after mashing together some components into a workable system. Rather, we used math to ensure our success.

This is very much what I like about the SRE method popularized by Google. The very first concepts are building mathematical models to prove the reliability of services and to ensure continuous improvement is, well, improvement. This is why I find myself oddly enamored with metrics and Observability. They are incredibly powerful tools that allow us engineers to automate real time mathematical models of our systems. No paper and pencils required…although some of us are like that.

This is the beginning of a new series of posts about using the SRE model with your systems and services to ensure their success. Keep the Google books handy for references. However, I hope these posts will be useful as a cheat sheet to working the math and Statistics involved. Grab some Observability tools (like Prometheus and buckle up. If you can embrace risk, you can handle working out the math.

Homework: Start reading through the Google SRE Workbook and reference the original book as needed.

 Previous  Up  Next


comments powered by Disqus