# Pingdom Alerts

## Background

Pingdom is used to monitor critical paths in both Tomorrow and Starshot. It does
so by hitting certain URLs from external nodes and verifying a proper response.
In any case where Pingdom fails to receive the expected response it will
immediately trigger from a second geographic location and if that one also fails
it will page.

## Tachyon App Error Response Codes

- 500: App Caught An Unhandled Server Error
- 502: App Received an Error Response from GQL
- [503](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html#http-503-issues):
  ALB returns this when there are no healthy ECS targets
- [504](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html#http-504-issues):
  ALB returns this when the connection it creates to a target results in a
  timeout

## Resolution

1. Check #emp-deploys for an in-progress or recently completed deploy. If there
   is one then verify that it is safe to roll back and then do so.

1. Check #site-production for ongoing incidents that might be impacting us. If
   an incident looks to be the root cause of our pingdom failures, use our
   [Logs and Metrics](../../apps/tomorrow/logs-and-metrics.md) to size-up and
   communicate how we're being impacted to aid their triaging efforts. Continue
   to monitor the situation and provide updates as level of impact changes.

1. Check the
   [GraphQL Dependency Overview](https://grafana.xarth.tv/d/qe2X2XAmz/graphql-dependency-overview)
   for any abnormalities in the last 10 minutes. This is our most frequent
   source of Pingdom alerts. If you see any that match the timing of our alerts.
   Notify in #site-production the services you suspect to be causing issues, the
   approximate % of users being impacted, and how. Continue to monitor the
   situation and provide updates as level of impact changes.

1. Check [Fastly's Status Page](https://status.fastly.com/) for ongoing issues
   that might be affecting us.

1. Investigate other possible root causes using our
   [Logs and Metrics](../../apps/tomorrow/logs-and-metrics.md).

1. Ping Matt (@follem) for the error message reported in Pingdom.
