# Postmortems

Emergencies and outages happen! It's inevitable. A postmortem attempts to determine what we can do in the future to help
prevent similar outages from occurring. Postmortems do *not* aim to answer the question "whose fault is this?", and the
outcome of a postmortem should *never* be "be more careful". There are pieces of our process and infrastructure that
explicitly allow these emergencies to happen, and we've all contributed to them being built this way. Let's fix them.

## How to wind down your emergency

If your emergency is still happening right now, go solve it. This page is for what to do after the emergency has been
resolved!

### Send an initial email

The first thing to happen after an emergency should be a short email to some combination of tech@twitch.tv and
emergency@twitch.tv that lists the following attributes of the incident:

1. **Summary**: Short description of what actually happened!
2. **Impact**: What effect did the emergency have on users?
3. **Root cause**: Why did this issue occur? It's entirely possible that you don't know this yet, as you haven't had the
   postmortem! It's okay to leave it out.
4. **Timeline**: Provide ordered timestamps of what occurred when; inlude everything from the deploy that caused the
   issue to the first time one of us noticed the issue to the resolution of the issue.
5. **Code changes**: If applicable, link to the code reviews, commits, or just plain diffs that caused and/or resolved
   the issue.

#### Example email

This is a real email sent out after a real emergency and it's great!

> Subject: 10/14 - Increased 5XX Responses From NGINX this Afternoon  
> To: TwitchTV Emergency \<emergency@justin.tv\>, tech \<tech@justin.tv\>  
> 
> **Summary**:  
> Starting at 12:10pm we saw an increased amount of 5xx responses from NGINX, peaking at 12:20pm and resolved by 12:30pm. SFO Rails boxes were incorrectly directing read traffic to AWS SiteDB. The reduced capacity of AWS read-slaves caused Rails utilization to pin at 100% and NGINX to return 5XX responses.
>
> **Impact**:  
During this time users will have experienced increased request/response times across the site and various APIs.
>
> **Root Cause**:  
> A few weeks ago, I merged a change to configure AWS Rails boxes to point to AWS SiteDB. The change included a conditional on a Factor variable to determine which SiteDB backend to load. On SFO boxes the aws_sitedb Fact is not defined, but the empty string returned by puppet resolves to true. Due to insufficient testing in all environments, this was not caught.
>
> **Timeline**:  
> 11:50AM - In an effort to resolve issues with consul-template Matt Bollier begins a puppet run across the fleet of rails-app boxes in SFO.  
> 1:00PM - Puppet run completes and Doug notices that Rails utilization is diverging and on an upwards trajectory. Matt Bollier notices that rails-postgres-replica-0 is experiencing heavy load.  
> 1:03PM - Doug identifies a broken HAProxy config.  
> 1:05PM - Doug manually pushes a known working HAProxy config to Rails boxes.  
> 1:08PM - Dan and Aaron Brashears identify the root cause, the SiteDB HAProxy configuration for AWS boxes has been pushed to SFO boxes.  
> 1:14PM - Doug reports utilization and CCU beginning to return to normal.  
> 1:43PM - Aaron Brashears pushes merges a puppet fix that addresses the issue for SFO boxes.  
>
> There are two changes to make rails-app HAProxy configs cleaners and puppet safe to run again:  
> [https://git.xarth.tv/systems/puppet/pull/2515][pull1] - Include only the necessary backends  
> [https://git.xarth.tv/systems/puppet/pull/2517][pull2] - Use str2bool to enable using facter variables for boolean comparison  
>
> Full 5-whys postmortem incoming...

[pull1]: https://git.xarth.tv/systems/puppet/pull/2515
[pull2]: https://git.xarth.tv/systems/puppet/pull/2517

### Perform a postmortem
Next, schedule a postmortem! Try to include everyone directly involved with the event without including too many people. Generally you want to cap it at around 6 people -- many more and you risk derailing the postmortem with comments and questions not directly related to preventing the problem in the future.

Your postmortem should happen 1-2 days after the emergency. You want people to sleep on it before the postmortem, but you don't want them to forget everything that happened. Postmortems typically hold a pretty high meeting priority; you will often be scheduling them on top of people's other meetings in order to get them to happen during the 1-2 day sweet spot.

### Send a follow-up email
After the postmortem, reply to your first email with all the information that's been discovered since. Link to your postmortem doc and, if necessary, include any short takeaways you think receivers should definitly see.
