# Past Alarms

Current cloudwatch alarms can be found in the [cloudwatch terraform module](https://git.xarth.tv/cb/achievements/blob/master/terraform/modules/pagerduty/alarm.tf).

## Beanstalk
`Maximum EnvironmentHealth GreaterThanOrEqualToThreshold 20.0`

This alarm indicates the beanstalk environment is in a severe state. The quickest way to diagnose a problem is to check the beanstalk [health](https://us-west-2.console.aws.amazon.com/elasticbeanstalk/home?region=us-west-2#/application/overview?applicationName=cb-achievements-production) page. Look for any unhealthy instances and any related messaging. Unhealthy instances can be terminated manually, and new instances will be spun up by the auto scaler.
If the health page does not contain enough information, logs can be downloaded from the instances in each application. Rollbar logs can also be checked.

## Redshift
`Average QueriesCompletedPerSecond LessThanOrEqualToThreshold 0.0`

This indicates that no queries are being completed in redshift. This could be because no new ones are running, or a single query is choking the cluster and causing all queries to not finish.

Terminating long running redshift queries from the console or [from the cli](https://docs.aws.amazon.com/redshift/latest/dg/cancel_query.html) can free up resources. As a last resort, rebooting the cluster should cancel all running queries and refresh the state of the cluster.

`Average PercentageDiskSpaceUsed GreaterThanOrEqualToThreshold 95.0`

The redshift cluster is running out of disk space. This is not necessarily caused by not enough allocated storage, but could be from temporary storage used when running bigger queries (such as the quest metrics query). Restarting the cluster can help remove temporary files and free up disk space.

Existing queries can be terminated to free up temporary resources. If the cluster is truly out of allocated storage, then it should be upgraded to a bigger size from the console. This can put the cluster into read-only mode for several hours. Terraform should be changed to accomodate this after the fact, to ensure that the cluster isn't downgraded by accident (note that terraform probably can't be applied to the cluster if it's not in a ready state, thus this change should be done manually).

`Maximum QueryDuration GreaterThanOrEqualToThreshold 1800000000.0 for ClusterIdentifier cb-achievements-production-tahoe-replica`

Queries are taking too long to run on this cluster. This is almost certainly the quest_metrics query.

Queries may become zombies or spin for very long periods on redshift. If this alarm does not auto-resolve, long running queries should be terminated manually from the console or cli. The query will be picked up again by the next scheduled cron job.

## RDS
### Storage Space
Indicates that the RDS cluster is running out of storage space

The quickest fix is to allocate more storage from the aws console. We are still able to read from the cluster during this stage. It's possible that the allocated space is being taken up by temp space due to large or long running queries. If space continues to run out after re-allocating, check for long running queries and terminate them manually.

### CPU
If the RDS cpu is spiking or remains high, the cluster itself should be resized to accomodate higher traffic. The resizing can be done from the aws console or from terraform in this case.

### Read/Write IOPS
If the read or write iops of the cluster get too high for a sustained period, it could mean that jobs are running into each other or have gotten too large for the cluster to handle.

Provisions iops should be increased through the aws console or through terraform. This could be a temporary change if iops drop back to normal later on. The root cause is likely the size of a batch job. Metrics can be investigated through beanstalk logs and through the achievements dashboard.

## DynamoDB

### cb-achievements-production-quest_progress_low_writes

This alarm summary is "Sum ConsumedWriteCapacityUnits LessThanOrEqualToThreshold 0.0 for TableName cb-achievements-production-quest_progress". Writes to the quests table depends on the cb-achievements-production-tahoe-replica Redshift cluster.

AWS periodically puts RedShift in maintenance mode. Check the [RedShift Dashboard](https://tiny.amazon.com/1hy788o8f/IsenLink) to see if cb-achievements-production-tahoe-replica is in maintenance mode.

## SQS

### cb-achievements-production-worker-queue-delayed

The cb-achievements-production-worker Beanstalk Worker environment is not keeping up with the volume of messages. You can try to increase the number of Min/MaxInstances associated with the environment, but you should be careful not to increase RDS IOPS or another alarm will trigger.
