## Roster Runbook
This describes the different components of Roster and how to operate them.

[twitch-cb-aws account](https://twitch-cb-aws.signin.aws.amazon.com/console)

[Pagerduty](https://twitchoncall.pagerduty.com/services/PINHF56)

[Service Catalog](https://status.internal.justin.tv/services/297)

[SLA](./service_level_agreement.md)

[Github](https://git.xarth.tv/cb/roster)

[clean-deploy](https://clean-deploy.internal.justin.tv/#/cb/roster)

### Deploying and Rollbacks
See the deployment [docs](https://git.xarth.tv/cb/roster/tree/master/docs/deployment.md) for help with deploying and rolling back.

### Monitoring and Errors
Errors are logged to [rollbar](https://rollbar.com/Twitch/CB_Roster/items/).

There is a Grafana dashboard [here](https://grafana.internal.justin.tv/d/000001760/cb-roster).

Beanstalk apps can also be monitored from the AWS console.
1. [API](https://us-west-2.console.aws.amazon.com/elasticbeanstalk/home?region=us-west-2#/environment/dashboard?applicationName=cb-roster-production&environmentId=e-7ksqjmvq2m)
2. [Worker](https://us-west-2.console.aws.amazon.com/elasticbeanstalk/home?region=us-west-2#/environment/dashboard?applicationName=cb-roster-production&environmentId=e-dwsar9r923)

### Pagerduty
[Pagerduty](https://twitchoncall.pagerduty.com/services/PINHF56)

Pagerduty is integrated with both Cloudwatch and rollbar. Pages are sent based on error rates and application health.

### Terraform
Infrastructure configuration is done using terraform. All these files live in the [git repo](https://git.xarth.tv/cb/roster/tree/master/terraform). Changes should be applied to both staging and production, and committed to the repo using a pull request.

### Elastic Beanstalk
Roster exists as an elastic beanstalk application, for both staging and production environments.
1. [production API](https://us-west-2.console.aws.amazon.com/elasticbeanstalk/home?region=us-west-2#/environment/dashboard?applicationName=cb-roster-production&environmentId=e-7ksqjmvq2m)
2. [staging API](https://us-west-2.console.aws.amazon.com/elasticbeanstalk/home?region=us-west-2#/environment/dashboard?applicationName=cb-roster-staging&environmentId=e-pt3mmb8vmx)

The Roster beanstalk application also contains a worker tier environment, which is used to listen to SNS messages and process them. This is useful for reacting to other events on the site that require changes to a channel's editors.
1. [production worker](https://us-west-2.console.aws.amazon.com/elasticbeanstalk/home?region=us-west-2#/environment/dashboard?applicationName=cb-roster-production&environmentId=e-dwsar9r923)
2. [staging worker](https://us-west-2.console.aws.amazon.com/elasticbeanstalk/home?region=us-west-2#/environment/dashboard?applicationName=cb-roster-staging&environmentId=e-xvwbmeric3)

### RDS
Roster is backed by a simple postgres database, with a single replica instance.
1. [production](https://us-west-2.console.aws.amazon.com/rds/home?region=us-west-2#dbinstance:id=cb-roster-production) and [production replica](https://us-west-2.console.aws.amazon.com/rds/home?region=us-west-2#dbinstance:id=cb-roster-production-replica).
2. [staging](https://us-west-2.console.aws.amazon.com/rds/home?region=us-west-2#dbinstance:id=cb-roster-staging) and [staging replica](https://us-west-2.console.aws.amazon.com/rds/home?region=us-west-2#dbinstance:id=cb-roster-staging-replica).

Configuration for these databases can be found in the terraform files mentioned in the previous section.

### Elasticache
Roster makes heavy use of caching, particularly for calls to get a channel's team membership. It uses redis in Elasticache.
1. [production cache](https://us-west-2.console.aws.amazon.com/elasticache/home?region=us-west-2#redis-nodes:id=cb-roster-production;clusters=cb-roster-production)
2. [staging cache](https://us-west-2.console.aws.amazon.com/elasticache/home?region=us-west-2#redis-nodes:id=cb-roster-staging;clusters=cb-roster-staging)

### Cloudwatch
API application health, worker application health, and primary RDS disk space are monitored through Cloudwatch.
1. [API application health alarm](https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#alarm:alarmFilter=ANY;name=cb-roster-production-api-health)

2. [Worker application health alarm](https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#alarm:alarmFilter=ANY;name=cb-roster-production-worker-health)

3. [Primary RDS disk space alarm](https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#alarm:alarmFilter=ANY;name=cb-roster-production-primary-rds-disk-space)

### Past Alarms
`Alert: Maximum EnvironmentHealth GreaterThanOrEqualToThreshold 20.0`

This alarm indicates the roster api is in a severe state. The quickest way to diagnose a problem is to check the beanstalk [health](https://us-west-2.console.aws.amazon.com/elasticbeanstalk/home?region=us-west-2#/environment/health?applicationName=cb-roster-production&environmentId=e-7ksqjmvq2m) page. Look for any unhealthy instances and any related messaging. Unhealthy instances can be terminated manually, and new instances will be spun up by the auto scaler.

If the health page does not contain enough information, logs can be downloaded from the roster instances [here](https://us-west-2.console.aws.amazon.com/elasticbeanstalk/home?region=us-west-2#/environment/logs?applicationName=cb-roster-production&environmentId=e-7ksqjmvq2m).
 
 Scaling Quickly
 ---------------
 
 In an emergency, you can use the console links below to manually edit parameters to scale up the API.
 
 **Elastic Beanstalk**: [IsenLink](https://tiny.amazon.com/17w73jv7p/IsenLink).
 - In Configuration, under Capacity hit "Modify"
 - Update Instances "Max" to desired value.
 - Hit "Apply" at the bottom of the page.
 
 **Make sure you update [Terraform](https://git.xarth.tv/cb/roster/blob/master/terraform/production/main.tf#L24) parameters for the above ASAP!**
 
 Not doing so makes them liable to get blown away the next time terraform is run.

### Dependencies
Roster calls the User service. Examples of such calls include fetching a channel to verify that it exists or setting the PrimaryTeamID for a channel. If the User service is down or experiencing failures, endpoints may return more errors and thus negatively affect the availability of Roster. 