# Runbook


[Following Service Grafana](https://grafana.xarth.tv/dashboard/db/following-service)

[Cloudwatch Logs](https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logs-insights:queryDetail=~\(end~0~start~-3600~timeType~'RELATIVE~unit~'seconds~editorString~'fields*20*40timestamp*2c*20*40message*0a*7c*20sort*20*40timestamp*20desc*0a*7c*20limit*2020~isLiveTail~false~queryId~'eadd5271-f945-4aca-a58d-5ff8db62fbad~source~\(~'*2ffollowing-service*2fproduction*2fapplication\)\))

**Note: For all unexpected issues, it is good to check in #site-production to see if there is any ongoing issue there.**

### following-service-production-app_surge
* Too many requests at the ELB level and backend servers cannot handle load (CPU too high/too many connections, etc)
* Resolution: Scale up backend following-service instances to help with load

### following-service-production-app_spillover	
* Too many requests at the ELB level that the surge queue has been filled and requests are dropped. Will return 503 errors
* Resolution: Scale up backend following-service instances to help with load

### following-service-production-app_backend_conn_error	
* Backend instances are unable to setup connections with ELB
* Resolution: Investigate if affected instances are healthy, potentially rotating instance and/or checking if a recent change broke something

### following-service-production-app_elb_5xx
* Backend instances are returning too many 500s
* Resolution: Investigate root cause of 500s from logs(Cloudwatch insights) and Grafana graphs for open circuits/issues 

### following-service-production-app_elb_4xx	
* Backend instances are returning too many 400s
* Resolution: Investigate root cause of 400s from logs(Cloudwatch insights) and Grafana graphs for issues 

### following-service-production-app_http_5xx	
* A backend instance is returning too many 500s
* Resolution: Investigate root cause of 500s from logs(Cloudwatch insights) looking at the specific instance and Grafana graphs for open circuits/issues 

### following-service-production-app_http_4xx	
* A backend instance is returning too many 400s
* Resolution: Investigate root cause of 400s from logs(Cloudwatch insights) and Grafana graphs for issues 

### following-service-production-canary-app-status
* The Canary deployment has failed instance checks and the health check is not passing
* Resolution: There might be an issue with a recent change/deployment. Investigate from instance logs (Cloudwatch insights) why the service is not starting

### following-service-production-app-status
* The Production deployment has failed instance checks and the health check is not passing
* Resolution: There might be an issue with a recent change/deployment. Investigate from instance logs (Cloudwatch insights) why the service is not starting

### following-service-production-app_unhealthy_hosts	
* Health check is failing on an instance too many times in a row. Something may have gone wrong with a deployment or an interim issue that has caused an issue with the service running.
* Resolution: There might be an issue with a recent change/deployment or some issue that caused the service to terminate. Check in the console which instances are failing the check and investigate those instance logs (Cloudwatch insights) why the health check is failing.

### following-service-production-canary-app-cpu
* CPU utilization is too high on the canary instance.
* Resolution: Investigate if load is related to a deployment or there's overall too much traffic by looking at CPU utilization of other production instances. If too much traffic, look into scaling up to have more resources.

### following-service-production-app-cpu
* CPU utilization is too high on production instances.
* Resolution: Investigate if load is related to a deployment or there's overall too much traffic by looking at CPU utilization of other production instances. If too much traffic, look into scaling up to have more resources. 

### following-service-production-app-high-cpu
* CPU utilization is too high on production instances. (Repetitive of last alarm)
* Resolution: Investigate if load is related to a deployment or there's overall too much traffic by looking at CPU utilization of other production instances. If too much traffic, look into scaling up to have more resources. 

### following-service-production-asg-low-cpu
* CPU utilization is too low on production instances. This is likely from traffic being severely reduced because upstream services are having issues.
* Resolution: Investigate in #site-production for any ongoing issues and look at what is affected in the Grafana graphs. The issue is likely due to site/traffic issues and will require solving the issue with those services

### following-service-production-app_latency
* Latency between the ELB and instances is too high. This could be an intermittent networking issue or related to a deployment
* Resolution: Investigate if there are any deployment/networking changes recently deployed that could have affected the environment. Look into the issue is isolated to one instance or multiple boxes
