A set of Site Reliability Engineering notes & challenges
updated at Jan. 12, 2024, 8:33 a.m.
A lifecycle model for describing incident management
updated at March 10, 2024, 12:03 p.m.
Run Book / Operations Manual template for modern software systems
updated at March 19, 2024, 3:31 p.m.
Tips and tricks for getting through on-call
updated at March 23, 2024, 5:56 a.m.
A collection of postmortem templates
updated at April 26, 2024, 4:46 p.m.
Compilation of public failure/horror stories related to Kubernetes
updated at April 27, 2024, 9:16 a.m.
A curated list of Site Reliability and Production Engineering Tools
updated at April 27, 2024, 2:11 p.m.
📙 Amazon Web Services — a practical guide
updated at April 27, 2024, 5:03 p.m.
A collection of postmortems. Sorry for the delay in merging PRs!
updated at April 27, 2024, 8:13 p.m.
A curated list of Chaos Engineering resources.
updated at April 28, 2024, 7:29 a.m.