# Oncall For Search and Discovery

We use [Pagerduty](https://twitchoncall.pagerduty.com/schedules#P46G6YW) for our alerts. You can get access to Pagerduty via the team lead (Ian).
We have two schedules, primary, secondary, each schedule has a user assigned to it, which rotates weekly.

Any incoming alerts go to the primary oncall schedule.
If the primary oncall is does not ack the alert, it escalates to the secondary oncall, which is the primary from the previous week.
If the secondary doesn't ack, it escalates to the entire team, user randomly chosen.

Alerts
-------------------

We have several main sources of oncall alerts:

* Cloudwatch - Monitoring for our AWS systems
* Cabot - Monitoring for service graphite statistics
* Bosun - Monitoring for service graphite statistics
* Runscope - Monitoring APIs for our services

[Cloudwatch](https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#)
-------------------
Our cloudwatch alerts are configured via terraform for our terraform services (discovery, searchindexer, similar-channels).

They are configured via the AWS console for our beanstalk services and other AWS resources.

Most of our CW alerts go to our Search-Discovery-OnCall, which is an escalation policy that holds an alert for 15 minutes before escalating it to primary. This is to prevent CW alerts that are usually in INSUFFICIENT_DATA that go to ALARM for only one period from alerting.

[Cabot](http://cabot.internal.justin.tv/accounts/login/?next=/services/)
-------------------
Cabot is a monitoring system that can http requests and query graphite metrics. We use it to monitor business and system metrics for our services.

Cabot is run on a dedicated box.

[Bosun](https://bosun.internal.justin.tv/)
-------------------
Bosun is the replacement for Cabot, for graphite metrics. It has similar features to Cabot, but with [its configuration in code](https://git.xarth.tv/systems/bosun-config) rather than UI like Cabot. Currently only a few alerts have been moved over to Bosun.

[Runscope](https://www.runscope.com/radar/ji54ym1dcl3b)
-------------------
Runscope is a great tool for monitoring APIs. Visage currently uses it extensively to test api.twitch.tv. We run our own agent to test our internal APIs since our main Runscope is run externally. It currently runs on the Jax canary box.

It occasionally stops working, if there's an alert saying "Agent disconnected", restart the runscope-agent on the box.
