# EML Oncall Runbook

E2 Messaging Layer is composed of multiple micro-services that run on ECS Fargate.

## On Call Preparation

 * General Preparation: https://wiki.twitch.com/display/DS/Oncall+Preparation
 * AWS accounts: `twitch-eml-dev` and `twitch-eml-prod`
 * Slack channel: `#eml`

## Monitoring

 * CloudWatch: https://isengard.amazon.com/federate?account=342135511598&role=admin&destination=%2Fcloudwatch%2Fhome%3Fregion%3Dus-west-2%23dashboards%3Aname%3DE2ML-Operational
 * Amazon Profiler: https://profiler.amazon.com/profile?profileName=TwitchAutoprof%2Fv1%2Fcode.justin.tv%2Fdevhub%2Fe2ml%2Fservices%2Fthreshold%2Fcmd%40%2Fus-west-2%2FPrimary%2FProd&type=live (or, go to profiler.amazon.com and search for "TwitchAutoprof e2ml").


## AWS ECS Services

EML services run in AWS ECS as Fargate service tasks. Direct isengard links to production services in AWS:

 * Threshold: https://isengard.amazon.com/federate?account=342135511598&role=admin&destination=%2Fecs%2Fhome%3Fregion%3Dus-west-2%23%2Fclusters%2FProdThresholdService-ClusterEB0386A7-28PXMIVABFVD%2Fservices
 * Greeter: https://isengard.amazon.com/federate?account=342135511598&role=admin&destination=%2Fecs%2Fhome%3Fregion%3Dus-west-2%23%2Fclusters%2FProdGreeterService-ClusterEB0386A7-14F243CYT8VPN%2Fservices
 * Pathfinder: https://isengard.amazon.com/federate?account=342135511598&role=admin&destination=%2Fecs%2Fhome%3Fregion%3Dus-west-2%23%2Fclusters%2FProdPathfinderService-ClusterEB0386A7-1I254F53RW2S0%2Fservices
 * Source: https://isengard.amazon.com/federate?account=342135511598&role=admin&destination=%2Fecs%2Fhome%3Fregion%3Dus-west-2%23%2Fclusters%2FProdSourceService-ClusterEB0386A7-M3EP3Q356332%2Fservices

See the [architecture.md](./architecture.md) doc for more info about services.

## Recipes

#### After Deploys, Check for Collisions of Address History

During the initial alfa tests, we identified a small chance for multiple Sources to handle a connection to the same address. This is called a "Collision" and is not supposed to happen.

Pathfinders talk to each other to decide what Source should get the new connection. However, if a hardware failure happens during high load and the Sources are restarted at the same time (e.g. during deploys), one Pathfinder may get delayed availability information, and decide to assign the new connection to another Source. This will cause a Collision.

There's planned work to fix this, but for now, we have to do the following manual steps to prevent and fix it after each deploy:

 * Check number of collisions in the CloudWatch dashboard.
 * If there are any collisions, that means multiple Sources have the same address history.
 * Go to [Pathfinder in AWS ECS](https://isengard.amazon.com/federate?account=342135511598&role=admin&destination=%2Fecs%2Fhome%3Fregion%3Dus-west-2%23%2Fclusters%2FProdPathfinderService-ClusterEB0386A7-1I254F53RW2S0%2Fservices), open the logs, and search for "Collision" to see WARN level logs with the collision errors.

Collision errors look like this:

```
warning Collision detected  ^loadtest@1?n=685 [ws://10.0.118.193:3003 ws://10.0.173.210:3003]
```

The error contains the IP addresses (`ws:<IP>:<port>`) of the Sources that have collisions. To stop the one of the Sources with the collisions:

 * Go to [Source in AWS ECS](https://isengard.amazon.com/federate?account=342135511598&role=admin&destination=%2Fecs%2Fhome%3Fregion%3Dus-west-2%23%2Fclusters%2FProdSourceService-ClusterEB0386A7-M3EP3Q356332%2Fservices)
 * Find one of the sources by IP. Either opening each source one by one, or searching in the logs: Go to Logs, container source, and search with the IP address. You shoul dsee the line "Reporting hostname ...", and a link to the Task.
 * Stop one of the Source tasks with one of the IPs from the log error message. Stop only one of the sources, so the colision is resolved. After that, go to CloudWatch and verify that the Source task is properly restarted and the collisions were resolved (collisions graph should no longer show any collision).


#### Emergency Restarts (quick)

The regular deploy/restart process is meant to be zero-downtime, keeping capacity as much as possible, avoiding any issues. But it is very slow. The important part is to make sure services come back online in order. For that reason, use the restart script when possible:

```
# Stop all task in Pathfinder, Threshold and Source (in order)
make hard-restart-prod
```

#### Restart E2ML Services (full slow restart)

##### Restart from the AWS Console:

Update the service with "Force new deployment" from the AWS console:

 * Repeat for each service in order: Pathfinder, Threshold, Source. (Note: Greeter is stable and almost never needs to be restarted)
 * Go to the AWS console ECS > Clusters > Select cluster > Select service > Update Service
 * Check "Force new deployment" and save. This will add new tasks and stop the old ones.

The restart will take a wile, specially on Threshold (here takes about 1 hour). To check when the deploy is done, you can see the start date of each task and make sure it is recent. It also helps checking the Service Lifecycle graph in the CloudWatch dashboard (it shows when new services start).

##### Restart with a script:

For convenience, there are shell scripts that use aws cli to call `ecs update-service --force-new-deployment` from the terminal. And then check the task timestamps:

Make sure to `mwinit` first, and then use the script in this order (see README - Deploy Order for details):

```
./scripts/ecs-restart.sh prod pathfinder
./scripts/ecs-restart.sh prod pathfinder --status

./scripts/ecs-restart.sh prod threshold
./scripts/ecs-restart.sh prod threshold --status

./scripts/ecs-restart.sh prod source
./scripts/ecs-restart.sh prod source --status
```

#### Restart a Single Task (kill one instace)

Sometimes a single task may have a problem with excessive CPU, or causing collisions. To restart a single task: Go to the AWS console ECS, select cluster and service, tasks, and kill/stop the task that needs to be restarted. When a task is stopped, ECS brings a new one back to meet the desired number of instances.

#### Deploy/Restart Threshold

Threshold is the service with the highes task count and because of that it is very slow to deploy.

There are many instances because it escales with the number of listeners (more users need more Thresholds). We currently provision the Threshold cluster with 10% more instances than needed. ECS starts new Threshold tasks on those extra instances, waits until they are healthy, and then kills the same number (to keep at least 100% of desired tasks running). For some reason, killing instances takes a long time.

One way to speed up a Threshold deploy, if load is low, you can manually kill tasks and ECS will start new tasks. This way you can go below 100% desired tasks which may be fine if load is low. If load is high, you can increase the number of extra ec2 instances in the autoscaling group in CDK first, that way each deploy batch is bigger and the full deploy is done in less steps.

#### Deploy/Restart Pathfinder

Pathfinder is designed to work with an odd number of instances (usually 3) so election votes (to pick a Source for a new address) can be resolved.

All other services are configured to keep between 100% and 200% desired number of tasks. This means that deploys will add new tasks first, and the start stopping old tasks. But Pathfinder is different, it requires 66% to 100% (2 or 3 instances).

Pathfinder requires an odd number of instances so election votes can be resolved (to decide what Source instance keeps a new address). A cluster of 2 requires 2 votes, a cluster of 3 requires 2 votes, a cluster of 4 requires 3 votes, etc. But the system is configured to require 2 votes. So for now, we can only have 2 or 3 instances. No more, no less.

During a deploy or restart, the Pathfinder ECS service kills one instance first (down to 2), and then adds a new one (back to 3). It will do this 3 times until all 3 instances have been replaced.

#### Fargate: Add more capacity

To add more instances on Fargate services, simply update the number of desired tasks in the task definition and Fargate will provision all the new instances as needed. Please do this from the CDK definition so the CloudFormation stacks stay in sync.

#### ECS/Ec2: Add more capacity

To add more Ec2 Tasks, you need first to provision more Ec2 instances. If you are updating capacity from CDK this is already implemented in the stacks logic, but if you are updating capacity from the AWS console, you have to go to Ec2 first:

 1. AWS Ec2 > AutoScaling Groups > Threshold Service Cluster.
 2. Actions > Edit, and Update Desired Capacity, min and max to the new desired number +1. One extra instance is required to allow deploy rollouts.
 3. AWS ECS > Clusters > Threshold Service Cluster > Threshold Ec2 Service
 4. Update > Number of tasks, and set to the desired number (number of ec2 instances -1).

Please do this from the CDK definition (it is also a lot simpler), so the CloudFormation stacks stay in sync.

#### Threshold is close to 100% CPU

Go to AWS Fargate and manually stop a task. The load is re-distributed evenly across all other Thresholds. If most Thresholds are close to 100%, please scale the service up first (e.g. by 2x) before stopping the overloaded task, so when the load is re-distributed, others can properly handle it.
