A set of Site Reliability Engineering notes & challenges
updated at Jan. 12, 2024, 8:33 a.m.
A lifecycle model for describing incident management
updated at March 10, 2024, 12:03 p.m.
Run Book / Operations Manual template for modern software systems
updated at March 19, 2024, 3:31 p.m.
Tips and tricks for getting through on-call
updated at March 23, 2024, 5:56 a.m.
A collection of postmortems. Sorry for the delay in merging PRs!
updated at May 1, 2024, 9:10 p.m.
A collection of postmortem templates
updated at May 2, 2024, 12:09 a.m.
Compilation of public failure/horror stories related to Kubernetes
updated at May 3, 2024, 12:26 p.m.
A curated list of Site Reliability and Production Engineering Tools
updated at May 4, 2024, 12:46 a.m.
A curated list of Chaos Engineering resources.
updated at May 4, 2024, 11:55 a.m.
📙 Amazon Web Services — a practical guide
updated at May 5, 2024, 2:45 a.m.