# MDaaS On Call Runbook

## On Call Preparation

 * General Preparation: https://wiki.twitch.com/display/DS/Oncall+Preparation
 * AWS accounts: `twitch-mdaas-ingest-dev` and `twitch-mdaas-ingest-prod`
 * Slack channel: `#mdaas-alerts`

## Steps
- Check out dashboards: go to AWS account CW dashboards "MDaaS-Overview" to see outliners.
- Check out [Rollbar](https://rollbar.com/Twitch/MDaaS-Ingest/) for new errors.
- Check out CW logs and search for errors.
- Lock the error source, if it is a AWS infra failure, navigate to it and see detailed monitoring dashboard plus logs. If it is a code defect, run MDaaS locally and figure out the buggy line.

## Recipes
More recipes should be added after oncall meeting if actual issues encountered for MDaaS.

### Kinesis throughput throttled
If Kinesis got throttled, there is nothing we could do right now. We need to create a ticket and notify users and do a resharding. (Scaling up). Re-sharding kinesis will disconnect all users and they have to reconnect to the updated shard for specific broadcaster info.

### CPU too high
CPU too high is usually caused by burst traffics but our application fails to scale up as fast as traffics coming in. What we could do in this case is manually over scale up to redirect traffics. A deployment might be needed to let client reconnect from overheated old boxes to new ones.

### SNS publishes failures
Check out rollbar for specific lines which are erroring. SNS publishes might be due to either code issue or temporary SNS self issues. If it is SNS temporary issue, there is not much we could do, it is recommended we wait a bit to see if it calms down.

### More to be added from oncall meetings

## Alerts
TODO
