# Alarms and Possible Causes

Alarm configurations can be found in sauron's terraform [here](git.xarth.tv/cb/sauron/tree/master/terraform/modules/cloudwatch).

All alarms in cloudwatch can be found [here](https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#alarmsV2:?search=cb-sauron&alarmFilter=ALL)

## Table of Contents
1. [Elastic Beanstalk](#elastic-beanstalk)
2. [Load Balancer](#load-balancer)
3. [Lambda](#lambda)

## Elastic Beanstalk

#### cb-sauron-production-api-health

|               |   |
|---------------|----------------------------------------------------------------------|
| **Trigger**   | Beanstalk application health is `degraded` or `severe` for 2 minutes |
| **Causes**    | Instances are down or returning high threshold of 5xx responses |
| **Resolution**| Check beanstalk health page, rollbar logs, and instance logs. Possibly add more instances or reboot instances |

#### cb-sauron-production-api-avg-latency

|               |   |
|---------------|----------------------------------------------------------------------|
| **Trigger**   | Average latency from instances is higher than 1s for 20 minutes |
| **Causes**    | Instances are down, over worked, or bottlenecked |
| **Resolution**| Check beanstalk health page, rollbar logs, and instance logs. Check grafana for endpoint timings |

#### cb-sauron-production-api-avg-cpu

|               |   |
|---------------|----------------------------------------------------------------------|
| **Trigger**   | Average CPU usage is higher than 80% for 20 minutes |
| **Causes**    | Instances are over worked or under sized |
| **Resolution**| Check instance health, add more instances or size up current instances |

# Load Balancer

#### cb-sauron-production-api-elb-latency

|               |   |
|---------------|----------------------------------------------------------------------|
| **Trigger**   | Average latency from ELB is longer than 500ms for 20 minutes |
| **Causes**    | High request load or not enough instances |
| **Resolution**| Determine cause of request load. Add more instances if needed |

#### cb-sauron-production-api-elb-5xx

|               |   |
|---------------|----------------------------------------------------------------------|
| **Trigger**   | More than 500 5XX errors are returned from the ELB in 2 minutes |
| **Causes**    | ELB is down or cannot handle request load |
| **Resolution**| Determine cause of request load. Potentially reboot ELB instance or add more instances |

#### cb-sauron-production-api-spillover

|               |   |
|---------------|----------------------------------------------------------------------|
| **Trigger**   | ELB spillover queue has length over 2500 items for 2 minutes |
| **Causes**    | Request rate has spiked and ELB cannot route incoming requests to instances |
| **Resolution**| Add more instances until the request rate has lowered |

#### cb-sauron-production-api-backend-5XX

|               |   |
|---------------|----------------------------------------------------------------------|
| **Trigger**   | The ELB is reporting that instances are returning more than 500 5XX errors in 2 minutes |
| **Causes**    | Instances are down or returning high threshold of 5xx responses |
| **Resolution**| Check beanstalk health page, rollbar logs, and instance logs. Possibly add more instances or reboot instances |

## Lambda

Each activity type lambda has their own set of alarms. Each alarm follows a similar structure, which is documented here.

#### cb-sauron-production-$activityType-lambda-duration

|               |   |
|---------------|----------------------------------------------------------------------|
| **Trigger**   | Average lambda execution time is greater than 5s over 10 minutes |
| **Causes**    | Lambdas are bottlenecked, dependencies are down or hanging, db writes are taking too long |
| **Resolution**| Check dependency health, check dynamodb health, check number of retries and concurrent executions |

#### cb-sauron-production-$activityType-lambda-errors

|               |   |
|---------------|----------------------------------------------------------------------|
| **Trigger**   | 10 or more lambda errors have occurred over the last 2 minutes |
| **Causes**    | Lambdas cannot write to dynamo or cannot properly read a message  |
| **Resolution**| Check lambda logs, check rollbar logs, check lambda health |

#### cb-sauron-production-$activityType-lambda-invocations

|               |   |
|---------------|----------------------------------------------------------------------|
| **Trigger**   | Lambda has not been invoked from 10 minutes |
| **Causes**    | SNS topic subscription no longer works, sqs queue isnt working |
| **Resolution**| Check permissions and existence of sns subscription, check sqs queue health |
