Writing Documentation

Documentation is every IT professional’s job. Yet, it can be one of the most overlooked areas of IT, DevOps, System Administration, Developer, or whatever your professional title might be. To make writing (and reading) documentation successful, make it easy to do. Create a set of stock templates that cover most of your writing needs. This may be design documents for how a specific bit of infrastructure works, or how the continuous integration pipeline works, even the architecture of a large coding project. This also easier and smaller things such as includes pager playbooks and common processes,

I used to keep a 3x5 note card with a generic layout I use to write documentation. Much of my IT related documentation follows this pattern. I grew tired of keeping up with the note card and moved it here for reference.

I lifted and simplified this layout Tom Limoncelli’s Ops Report Card section on documentation. Over the years I’ve made some changes to better fit my practices. There is quite a lot of documentation a new service of bit of infrastructure could end up generating. Part of the process is defining what goes on a single wiki page verses what might be a collection of supporting pages.

The TL;DR (Too Long; Didn’t Read) factor is a problem. Its important that any documentation clearly state what will be learned by reading and who the intended audience is in the very first text on the wiki page. This may even be a bullet list, or the exact name of an alert and a short description a Pager Playbook will resolve.

Using these notes, build common templates. Build common documentation patterns in your wiki. Make documentation about services easy to find and standard.

Overview or Summary

A summary of what this is.
Where does this service live?
Why do we need it?
Upstream documentation
Other moving parts that make the whole
Subject Matter Expert contacts
Locations of following documentation if multiple pages are used

Design

Visual representation of how this works
Logic or Data Flow
Alternative solutions and why they were not chosen
How high availability is achieved

Common Tasks or Process

FAQ
Provisioning
Common tasks needed for care and feeding.
If this is a documented process that process, step by step, goes here.

Deployment or Building

Do we build the software locally, how do we do it
How do we deploy more of these machines or replace busted ones
Where is the configuration, in Puppet/Chef/Ansible
Hardware Requirements

How might the system fail?
What does failure mean?
What risks do we run?
What to do to restore each service or part
What side effects happen when specific parts are down or malfunctioning
A Playbook for each kind of alert?

Disaster Recovery Plans

How is (or isn’t) this system recoverable from a disaster situation?
What disasters have we planned for?
HA plans can fit here too
Steps that need to happen to recover
Risks

Key Performance Indicators

Service Level Agreement (either real/legal or social)
Service Level Objectives: 95% or more of requests in a 24 hour period will be serviced in less than or equal to 500 ms.
See Circonus’s Documentation – please try to get the math right

Notes

Any notes about the service
Things that don’t fit well above
Future to do or improvements
Uncommonly needed tasks

LinuxCzar