# Web on call

This should be a go-to guide for how to diagnose and perform various fixes for common problems that happen concerning web on call.

## Ownership
web on call is shared among a set of business units and teams.  These teams are being brought up to speed with on call starting 4/17/2017.  This table contains the current assignments and availability status:

![Web/Web contacts](images/web_web_contacts.png "Web/Web contacts")

The most current version of this data is in Google sheet [Web/Web contacts](https://docs.google.com/a/justin.tv/spreadsheets/d/1oywakct4Ag3olch-eGS_ZmkyglBFarH2_hpqPgWXSSA/edit?usp=sharing)

### Scheduling
The on call schedule is managed in [PagerDuty](https://twitchoncall.pagerduty.com/schedules#P30GJFL) and rotates through the list of callees on a weekly basis.  If someone is going on PTO or will be otherwise unavailable during their active on call period, there is an *Override Layer* that can be used to substitute an alternative person to handle the call.


## Triage
The first thing to do when any issue is raised is to triage it. Generally, we want to get things back to a state of stability as quickly as possible. The triage process should involve answering the following questions:

* What is the impact? (e.g., front page is down for all users)
* What is the root cause? (e.g., was it a bad deploy, hardware failure, etc.?)
* What can we do to mitigate it? (e.g., roll back code, disable feature flag, etc.)

The member of the team who is on-call may file an issue and assign it a severity according to [Twitch Severity Levels and SLAs](https://docs.google.com/document/d/1PlgBp9scb0ov_ER5QKrQMZJ3XXVyDGn7KzZlFvJR7Xk/edit?ts=583dcabe#heading=h.xpianakqqjx). It should be pretty clear which BU owns the responsibility for the bug. If that responsibility is not clear, it’s going to be the responsibility of the on-call member to resolve.

* The responsibility for Sev-1, Sev-2, and Sev-2.5 will most likely have the level of urgency to resolve.
* Sev-3 issues can be evaluated to make sure they actually affect the SLA and whether effort required is warranted for the gravity of the issue.

### Communication
After receiving a page, **Slack** is the primary mechanism used for communicating during triage.  The **#web-oncall** room can be used to communicate with the group of users who handle on call.  They can offer pointers for triage and possible next steps. **#site-production** and **#emergency** should also be monitored as discussion will often pick up very quickly in those rooms when a site issue arises.

Once determining/verifying the impact, we should focus on pinpointing the root cause of the failure and then figuring out what we can do to mitigate. In the case of a bad software deployment, this usually means rolling back to a previous version or reverting that specific commit, when feasible. Look for recent deploy notice in Slack **#web** 

In some cases, a failure of another system can manifest as an alert in our own, in which case the on call person should escalate to that team. For example, DNS resolvers being down should be escalated to the systems on call team. Look for discussions in Slack **#site-production** and **#emergency** to see if someone is aware of the dependent failure.

### Dashboards

Dashboards are your friend. They can be used to quickly spot things that are out of the ordinary and allow you to focus the investigation on a particular system or area.

The following links should be useful in most every oncall scenario in trying to determine the root cause:

* [Grafana - Rails - TV](https://grafana.internal.justin.tv/dashboard/db/rails-tv) General health of the stack, response times, throughput, etc.
* [Grafana - Rails Service Timings](https://grafana.internal.justin.tv/dashboard/db/rails-service-timings) Useful for viewing health of underlying services that Rails talks to
* [Grafana - Rails - Workers](https://grafana.internal.justin.tv/dashboard/db/rails-workers) General health of various asynchronous jobs that run under web/web
* [Grafana - Rails Availability & Latency](https://grafana.internal.justin.tv/dashboard/db/rails-availability-and-latency) Uptime, success/failure, latency
* [Grafana - Edge Infra](https://grafana.internal.justin.tv/dashboard/db/edge-infra) General health of the infrastructure (ELB 5xx rates, etc.)
* [Ganglia](https://ganglia-ec2.internal.justin.tv/) Useful for looking at groups of hosts (database, app boxes, varnish, etc.)
* [New Relic](https://rpm.newrelic.com/accounts/26263/applications/131219/transactions?type=app#id=149487134&tab-transaction-149487134=app_server_historical_performance) Note that only a few of our servers are profiled by New Relic. A widespread problem can be seen here, but not one that only plagues a few servers which are not in this group
* [Rollbar](https://rollbar.com/Twitch/Website/items/) Errors/warnings captured via app code and CloudWatch alerts
* [Trace](https://trace.internal.justin.tv/) Choose web/web service to generate report
* [Crono (Jobs) Dashboard](https://admin.internal.twitch.tv/admin/jobs) Status of scheduled job executions

### Request Pipeline

Requests to web/web go through various caches and an ELB.
For `www.twitch.tv` a request routes through the following:

cdn (cache) -> elb (no cache) -> nginx (no cache) -> varnish (cache) -> Rails -> memcache (cache).

It may be worth questioning whether the fault lies somewhere in this pipeline before digging too deeply into the rails service itself.

### WebCDN

Health of the CDN can be monitored using these dashboards:
* [Grafana - CDN Metrics System Health](https://grafana.internal.justin.tv/dashboard/db/cdn-metrics-system-health) Aggregated success/failure rates
* [Grafana - CDN Metrics Error Rates](https://grafana.internal.justin.tv/dashboard/db/cdn-metrics-error-rates) Per region success/failure rates
* [Grafana - CDN Metrics Timings Breakdown](https://grafana.internal.justin.tv/dashboard/db/cdn-metrics-timings-breakdown) Per region delivery timings

If the CDN is experiencing issues, send an email to our contact: [James Host](mailto:jhost@twitch.tv) TPM - CDN Edge. Symptoms of CDN issues include: boxart/thumbnails not loading, emotes not loading. If these issues are seen on the site or being reported by users on Twitter, send an email and collect as much information as possible from users experiencing issues to provide to CDN Edge if necessary.

### Rails Utilization at 100%

This is the most common symptom of an issue that results in on call being paged. Most of the time it is due to Rails waiting on a network call to return that is taking longer than usual, but not always.

Some common cases:

* Database responses taking too long. Can be verified through NewRelic or Grafana dashboards.
* External system taking too long to respond. Similar to the database issue, may be able to verify through NewRelic.
* Bad haproxy configuration. May require rollback of puppet config.
* Rails Worker boxes occasionally go to 100% CPU.  There are 66 rails-worker instances running as of 4/12/2017.

The triage process should identify the system that is not healthy and escalate the issue to that team.  The rails-worker boxes should be terminated and restarted when they are at 100% CPU.

Steps to do this:

1. Login to AWS console, account **twitch-web-aws**
2. In the Services menu, choose **EC2**
3. Click on **Instances**
4. In the search field type **rails-worker** and press enter
5. Select via checkbox all/some of the instances that are at 100% CPU
6. In Actions menus, choose **Instance State -> Terminate**

Note: New instances will automatically be restarted after existing are terminated.


![AWS Console](images/terminate_rails_worker_instance.png "AWS Console EC2 Instances")

These errors can be mitigated by using proper timeout values and potentially wrapping the calls in Hystrix.

### Restarting Unicorn across all boxes

Sometimes the best fix is to just reboot everything. The easiest way to do this is through the [oncall scripts](https://git.xarth.tv/web/oncall).

### Checking if the job queue is backed up

Open the [web view](../playbooks/rabbit.md#to-open-the-web-view). Check the "queues" tab to see a detailed view of queue lengths, job consumption rates, and job production rates.

Keep in mind that jobs often spawn other jobs, and job production rate is sampled over a small period of time, so it will spike and drop often -- for example, a stream_up job can spawn a million or more email notification jobs if the broadcaster whose stream came up has a lot of followers. The more important thing to look at is whether or not the overall queue size is going down over longer periods of time (several minutes).

### How to start the rabbitmq server if it goes down
If [rabbit is down](https://git.xarth.tv/twitch/docs/blob/master/playbooks/rabbit.md), ssh into the appropriate box and restart it.

    sudo /etc/init.d/rabbitmq-server start

### How to disable HTTP & workers on app servers

    sudo -u jtv touch /home/jtv/justintv/current/tmp/pids/disabled

If you want workers to stay up, you should do this, then disable unicorn_rails, then remove the disabled file.

    sudo -u jtv rm /home/jtv/justintv/current/tmp/pids/disabled

You can always verify whether you're taking HTTP traffic by doing

    tail -F /var/log/jtv.log

You'll also see `cbrails[some_number]` lines, which show you the status of the workers.

### Checking whether services are running

For unicorn (i.e. web requests):

    sudo svstat /etc/service/unicorn_rails
    /etc/service/unicorn_rails/: down 476 seconds, normally up

For workers:

    sudo svstat /etc/service/cbrails*
    /etc/service/cbrails1: up (pid 16530) 73 seconds
    /etc/service/cbrails2: up (pid 16452) 74 seconds
    /etc/service/cbrails3: up (pid 16451) 74 seconds
    /etc/service/cbrails4: up (pid 16456) 74 seconds
    /etc/service/cbrails5: up (pid 16455) 74 seconds
    /etc/service/cbrails6: up (pid 16453) 74 seconds
    /etc/service/cbrails7: up (pid 16557) 73 seconds
    /etc/service/cbrails8: up (pid 16454) 74 seconds

### Running commands on multiple servers

    jtv@brigade1:~$ pdsh -R ssh -w app[5,7-10,12-19,22,27] sudo svstat /etc/service/cbrails*
    
## Long Term

If after triage there are potential action items left, please log an issue in jira with the "oncall" label. If possible, identify and assign an assignee for this issue.

An example issue is [here](https://twitchtv.atlassian.net/browse/COMMERCE-53). These issues show up on the ["Oncall" jira board](https://twitchtv.atlassian.net/secure/RapidBoard.jspa?rapidView=228).

As of this writing, there is a desire to have a TPM assigned to manage this board and help us make sure all issues are moving beyond triage and being fully remediated.

Active goal and deliverables include decentralization of web/web into business unit owned micro services.

