# Playbooks

Woken up by the pager? Trying to debug the site? Curious about how things work
under the hood? These are here to help you figure out how to navigate the
infrastructure behind our site, and generally keep the site up and running.

This page provides details (and links) to services and other docs in this repo. If you're currently on-call and looking for a quick guide to diagnosing some issues, start with specific teams' on-call guides:

* [web](../web/on_call.md)
* [TMI](../tmi/on_call.md)

## Monitoring

Here are all of the bits of monitoring infrastructure that we use to keep track
of what's going on with the site.

* [Ganglia](http://ganglia.internal.justin.tv): LDAP
* [Graphite/Statsd](http://graphite.internal.justin.tv): non-LDAP credentials,
  ask for details
* [Nagios (Dirty)](http://nagios.internal.justin.tv): LDAP
* [Nagios (Clean)](http://nagios-clean.internal.justin.tv): LDAP
* [PagerDuty](http://twitchoncall.pagerduty.com): ask your manager for an invite
  to set up an account
* [Rabbit](rabbit.md): non-LDAP credentials,
  ask for details
* [Rollbar](rollbar.md): ask Doug
* [New Relic](https://rpm.newrelic.com/accounts/26263/applications/131219): also
  ask Doug

### Hystrix circuits

We use a library called [Hystrix](https://github.com/afex/hystrix-go) to implement the circuit breaker pattern in our applications.  There is a dashboard where you can watch live metrics from these circuits, and will help give you visibility into which backends may be having issues during outage events.

* Web
  * [Production](http://hystrix.dev.us-west2.justin.tv/hystrix_dashboard/monitor/monitor.html?stream=http%3A%2F%2Fhystrix.dev.us-west2.justin.tv%2Fturbine%2Fturbine.stream%3Fcluster%3Dweb&title=web-production)
* Web Workers
  * [Ruby Production](http://hystrix.dev.us-west2.justin.tv/hystrix_dashboard/monitor/monitor.html?stream=http%3A%2F%2Fhystrix.dev.us-west2.justin.tv%2Fturbine%2Fturbine.stream%3Fcluster%3Dweb_ruby_workers&title=web-ruby-workers-production)
  * [Go Production](http://hystrix.dev.us-west2.justin.tv/hystrix_dashboard/monitor/monitor.html?stream=http%3A%2F%2Fhystrix.dev.us-west2.justin.tv%2Fturbine%2Fturbine.stream%3Fcluster%3Dweb_go_workers&title=web-go-workers-production)
* Discovery
  * [Production](http://hystrix.dev.us-west2.justin.tv/hystrix_dashboard/monitor/monitor.html?stream=http%3A%2F%2Fhystrix.dev.us-west2.justin.tv%2Fturbine%2Fturbine.stream%3Fcluster%3Ddiscovery&title=discovery)
* Chat
  * [Production](http://hystrix.dev.us-west2.justin.tv/hystrix_dashboard/monitor/monitor.html?stream=http%3A%2F%2Fhystrix.dev.us-west2.justin.tv%2Fturbine%2Fturbine.stream%3Fcluster%3Dtmi_goclue_production&title=tmi_clue_production)
  * [Darklaunch](http://hystrix.dev.us-west2.justin.tv/hystrix_dashboard/monitor/monitor.html?stream=http%3A%2F%2Fhystrix.dev.us-west2.justin.tv%2Fturbine%2Fturbine.stream%3Fcluster%3Dtmi_goclue_darklaunch&title=tmi_clue_darklaunch)

#### Not yet ported to clean hystrix dashboard

* Deploy
  * [Production](http://stats11.sfo01.justin.tv:8080/hystrix_dashboard/monitor/monitor.html?stream=http%3A%2F%2Fstats11.sfo01.justin.tv%3A8080%2Fturbine%2Fturbine.stream%3Fcluster%3Dskadi)

## Infrastructure

* Puppet

## Routing and Caching Services

* nginx
* varnish
* haproxy
* rabbitmq
* cassandra
* memcache
* postgresql
* Level 3 (CDN)

## Twitch Services

* unicorn/app (twitch.tv website and APIs)
    * Architecture
    * Playbook
* web workers (cbrails/mailapp)
    * Architecture
    * Playbook
* jax
    * Architecture
    * Playbook
* discovery
* tmi
* countess
* owl

## Individual Services and Systems

* AWS
## External Scenarios

Here are a few common scenarios that can cause us to have problems - we've had
to deal with these more than once. Links to specific procedures and tools to
diagnose will be added over time.

* BetterTTV DDOS of APIs/endpoints - check night's BetterTTV repo on github for
recent changes.
* External website w/ embeds causes high API load (via cache-busting API request)
- use varnish monitoring/nginx referrers in logs to determine if problem is caused
by a particular site.
* DDOS attacks of various flavors.
