A set of Site Reliability Engineering notes & challenges
updated at Jan. 12, 2024, 8:33 a.m.
A lifecycle model for describing incident management
updated at March 10, 2024, 12:03 p.m.
Run Book / Operations Manual template for modern software systems
updated at March 19, 2024, 3:31 p.m.
Tips and tricks for getting through on-call
updated at March 23, 2024, 5:56 a.m.
Compilation of public failure/horror stories related to Kubernetes
updated at April 16, 2024, 3:47 p.m.
A collection of postmortem templates
updated at April 19, 2024, 12:42 a.m.
A collection of postmortems. Sorry for the delay in merging PRs!
updated at April 19, 2024, 3:30 p.m.
A curated list of Chaos Engineering resources.
updated at April 20, 2024, 8:24 a.m.
A curated list of Site Reliability and Production Engineering Tools
updated at April 20, 2024, 2:18 p.m.
📙 Amazon Web Services — a practical guide
updated at April 21, 2024, 5:48 a.m.