Documentation is every IT professional’s job. Yet, it can be one of the most overlooked areas of IT, DevOps, System Administration, Developer, or whatever your professional title might be. To make writing (and reading) documentation successful, make it easy to do. Create a set of stock templates that cover most of your writing needs. This may be design documents for how a specific bit of infrastructure works, or how the continuous integration pipeline works, even the architecture of a large coding project. This also easier and smaller things such as includes pager playbooks and common processes,
I used to keep a 3x5 note card with a generic layout I use to write documentation. Much of my IT related documentation follows this pattern. I grew tired of keeping up with the note card and moved it here for reference.
I lifted and simplified this layout Tom Limoncelli’s Ops Report Card section on documentation. Over the years I’ve made some changes to better fit my practices. There is quite a lot of documentation a new service of bit of infrastructure could end up generating. Part of the process is defining what goes on a single wiki page verses what might be a collection of supporting pages.
The TL;DR (Too Long; Didn’t Read) factor is a problem. Its important that any documentation clearly state what will be learned by reading and who the intended audience is in the very first text on the wiki page. This may even be a bullet list, or the exact name of an alert and a short description a Pager Playbook will resolve.
Using these notes, build common templates. Build common documentation patterns in your wiki. Make documentation about services easy to find and standard.
Overview or Summary
- A summary of what this is.
- Where does this service live?
- Why do we need it?
- Upstream documentation
- Other moving parts that make the whole
- Subject Matter Expert contacts
- Locations of following documentation if multiple pages are used
- Visual representation of how this works
- Logic or Data Flow
- Alternative solutions and why they were not chosen
- How high availability is achieved
Common Tasks or Process
- Common tasks needed for care and feeding.
- If this is a documented process that process, step by step, goes here.
Deployment or Building
- Do we build the software locally, how do we do it
- How do we deploy more of these machines or replace busted ones
- Where is the configuration, in Puppet/Chef/Ansible
- Hardware Requirements
- How might the system fail?
- What does failure mean?
- What risks do we run?
- What to do to restore each service or part
- What side effects happen when specific parts are down or malfunctioning
- A Playbook for each kind of alert?
Disaster Recovery Plans
- How is (or isn’t) this system recoverable from a disaster situation?
- What disasters have we planned for?
- HA plans can fit here too
- Steps that need to happen to recover
Key Performance Indicators
- Service Level Agreement (either real/legal or social)
- Service Level Objectives: 95% or more of requests in a 24 hour period will be serviced in less than or equal to 500 ms.
- See Circonus’s Documentation – please try to get the math right
- Any notes about the service
- Things that don’t fit well above
- Future to do or improvements
- Uncommonly needed tasks