LinuxCzar

Engineering Software, Linux, and Observability. The website of Jack Neely.    

Rules of a System Administrator

  May 30, 2021

Over the years, I’ve developed some rules of thumb for systems administration and operations. They serve me well and might serve others too. Here they are in no particular order:

1. Invest In Openness

Most of my solutions involve Open Source Software or Open Standards. They build the most scalable and rigorous infrastructure. Participating in the Open Source community elevates the tools all of us use and is far more productive in solving problems. Using Open Standards attracts the best talent that is already engaged with the community and experienced with these solutions to your team.

2. Backups Are Sacred

Data is our most valuable asset. Make sure its recoverable. See rule #4 and rule #5. Backing up to a single location isn’t always recoverable. This includes Configuration Management and being able to reproduce an existing system.

3. Use a Versioning Tool (Git)

Use a versioning tool with all documentation, code, scripts, configuration, packages, everything. It makes things easier to backup (although versioning tools alone are not backups). It gives the power to locate exactly what changed to cause a bug, who made the change, and why. It gives the ability to travel in time, which is always handy.

4. Failure Will Happen – Plan For It

From hard drives to cloud providers failure will happen. Not if, but when is the crucial question. Know the technologies and tools well enough to plan for failure and be able to handle it gracefully. Use SLAs and SLOs to judge how and when to react to failure.

There are three main failure domains to plan for.

  1. A dependent API or Service fails.
  2. An availability zone becomes inoperable.
  3. A regional outage requires servicing customers from a different global location.

Understand and record what failure scenarios should be anticipated and planned for. This needs to be much more specific than “loss of Data Center 1” or “us-east-1 fails.” If an asteroid falls out of the sky, we are going to have larger problems to solve first.

5. Automate Everything

If you might do a task again its worth automating. Repeating the task becomes significantly more efficient, and the process will not be lost or forgotten for rarely performed work.

6. Testing Is a Ritual

Test, test, and automate testing. Don’t touch production with an untested process. Don’t write untested documentation either.

7. Never Be Without a Pen/Pencil

Some things are best kept on paper and few software solutions can adequately hold and represent what is in the brain. Keep a notebook and writing instrument handy. The best system designs are sketched on paper.

8. The Scotty Factor

Multiply time estimates by 4. The task will usually take longer than first thought. Sometimes the reputation of a miracle worker follows.

9. Network With Peers

There is nothing more valuable than your own network of IT folks. Maintain friendships and find peers inside and outside the company to bounce ideas off of.

10. Read Only Friday

Use the last day of the week for documentation, coding or anything other than making changes to anything that’s remotely production. Time is needed to catch up on these tasks and no one likes weekend pages due to a Friday change. Or evening pages due to a late day change.

The documentation written is invaluable for those that handle pages, including yourself.

Other Quotes to Live By

  1. “NO system should EVER rely on user behavior to remain stable.”
  2. “You either do your job well, or you do your job continuously.” – King’s Law
  3. “Every time I fix a problem by rebooting (rather than knowing the real cause and fixing it) I feel a little bit of me dies inside. It hurts our industry and our profession when we develop bad habits like guessing instead of knowing.” – Tom Limoncelli
  4. “Fools ignore complexity. Pragmatists suffer it. Some can avoid it. Geniuses remove it.” – Alan Perlis