# Piper-Service On Call Runbook

## On Call Preparation

 * General Preparation: https://wiki.twitch.com/display/DS/Oncall+Preparation
 * AWS accounts: `twitch-service-piper-dev` and `twitch-service-piper-aws`
 * Slack channel: `#insights-alerts`

## Airflow Dashboard

Piper dashboard pages (Twitch VPN required):

 * staging: https://staging.piper-airflow.twitch.a2z.com/admin/
 * production: https://prod.piper-airflow.twitch.a2z.com/admin/

There is a list of dags with recent "Dag Runs" and "Recent Tasks". In cases that you have been alerted to failures in reporting, you'll likely spot one or more red-circle dag run, indicating a failed dag run. If you click through the red circle, you can select individual DagRuns by clicking the link under the "Run Id" column for the failed row you're interested in. Doing so will take you through to the graph view of the DagRun.

#### Dag Graph View

One the Dag Graph View you can see what's gone wrong with the DagRun (tasks that failed have a red border) you can click on the task and select "View Log" to **examine the logs**. In some cases, DagRuns have been failed not because of a structural issue but due to a timeout due to scheduler volume. In this case, the failed task will not have any logs and simply needs to be retried.

#### Retry Dag Tasks

Airflow doesn't have an option to retry, but it allows you to "Clear" a set of tasks. The scheduller will find this tasks and decide if they should run again and in which order. All dag tasks in piper were made idempotent, so it is fine to clear them when they fail so they run again from the beggining.

Once you've corrected the underlying problem and are prepared to retry the dag, you can click on the failed task and click the "Clear" button. By default the "Downstream" and "Recursive" options next to the Clear button will be selected, and should stay this way. Clearing the failed task will cause the task to be rerun, reset the dag to the Running state, and queue all downstream tasks.

Sometimes a dag will be marked as failed even though all tasks succeeded. In this case, you can either clear the first task or set the dag to running.


## Grafana Dashboards

Piper has a data [visualization dashboard](https://grafana.internal.justin.tv/d/000001708/insights-piper-data-visualizations?orgId=1) to let use monitor if data is reasonable, outliners and abnormal trends could all be an indicator of piper failures.

The [Operation Dashboard](https://grafana.internal.justin.tv/d/9n_KQCLmk/piper-operation-dashboard?orgId=1) helps us to know host status, monitor ELB and other operation metrics we might need to be oncall.


## Report SLA: 2 business days after report is due

Most piper issues are not urgent. Reports are generated asynchronously and it is expected that some days will take longer than others.

Reports for a given UTC day will start 24 hours after the day is over, and may up to 48 hours to be complete. That gives us 2 days to respond to problems, but it feels more like 4 days for the report date.

For example, the data for `Sep-10 UTC` is going to be available sometime between `Dec-12 00:01 UTC` and `Dec-13 23:59 UTC`.

 * During `Sep-10` from `00:00 UTC` to `23:59 UTC`, events are recorded in data science.
 * Data science events may take minutes to hours to be available. Because of that, piper-airflow waits 24 hours to start generating the reports.
 * On `Sep-12 00:01 UTC` (`Sep-11 17:01 PST`) piper starts generating reports for `Sep-10`.
 * Reports should be available a few hours later. For example on `Sep-12 02:11 UTC` (`Sep-11 15:11 PST`).
 * If there's a problem, we have until `Sep-13 23:59 UTC` (`Sep-12 16:59 PST`) to fix the issue.


## Alerts

### Rollbar and Slack

Piper dags are configured using `on_failure_callback` to report exceptions to Rollbar (piper bucket). And Rollbar is configured to post new errors to the Slack channel `#insights-alerts`.

Dag failures and python exceptions do not create Pagerduty alerts. This is because failed dags retry every 2 hours up to 5 times, and since they start around 5pm they will page us around 2am. Also they have a tendency to fail during special days like Christmas. A failed dag is a problem, but a problem that can be solved in 48 hours. There's no need to lose sleep time because a task could not connect to s3. But the next day the dag should be cleaned to retry. We don't use Pagerduty alerts, but the engineer on call needs to pay attention to the `#insights-alerts` channel in Slack.

Rollbar only notifies of new errors in Slack. Rollbar errors are grouped based on the fingerprint that we provide using their API (see `rollbar.py` file for details). We can't use the default grouping algorithm because that would cause hundreds or thousands of errors if the write_report_xx dags fail. Instead we use a fingerprint that groups errors by dag, report date and exception type.


### Pagerduty Alerts

Some alerts that are configured on PagerDuty. Find them under "Incident Details", then "alarm name":

---

**Minimum HealthStatus LessThanThreshold 1.0 For ClusterIdentifier `cluster`**

This alert means that one or more nodes on one of our datamarts is misbehaving.  If this alert occurs on a Wednesday afternoon, this most frequently means that automatic maintenance is occurring.  The event log can be viewed from
 the redshift page of the AWS console for more information about why the node is down.

---

**Maximum UnHealthyHostCount GreaterThanThreshold 0.0 for LoadBalancerName `balancer`**

This alert means that the production airflow master has gone down.  This can sometimes happen as a result of a memory issue in the scheduler- the "dumb" way to stand it back up is to redeploy
 piper, but if the problem persists, airflow journals will need to be pulled to see what is going on.

---

**Maximum CPUUtilization GreaterThanThreshold 90.0**

This alert means that the airflow worker CPUs are overtaxed- the level of processing power required by our DAGs is not extreme,
 so the simplest and best solution may be to scale the cluster up.  Be sure to check the dashboard to verify that workers are
 not being overtaxed due to a bad situation.

---

**Maximum CPUUtilization GreaterThanOrEqualToThreshold 75.0 for ClusterIdentifier `cluster`**

This alert means that one of our redshift clusters is overtaxed.  As long as our SLAs are being met, this may
 be a good candidate for a longer discussion with the team, particularly if changes have been deployed recently
 that may be responsible for a large increase in CPU usage.  However, beyond 50% CPU, redshift clusters
 frequently drag.  Please monitor the DAG progress to verify that reports are being delivered.

---


## Recipes

### Moving Average Data Validation

As new reports are produced, the rollup data is stored in our Aurora Postgres database, adding one row per extension/day and game/day. Once a new day is processed, the new data is aggregated and validated against the moving average of the last 90 days. Since this is a statistical analysis, there's a chance of false alarms, and even if the validator found broken data, we have 48 hours to fix it, so there's no need to panic. Becuase of that, validation failures don't trigger a pagerduty alert, but instead are tracked as Rollbar warnings.

Rollbar warnings are posted to Slack and look like this:

```
New Warning: ValidateExtensionsOverviewV2MovingAvg 2019-12-06
Piper in production
```

When you see a validation error like this one, the first thing to do is to check the full error message in Rollbar, which includes exactly the fields that need to be inspected. For example: `Field uninstalls is 231, expected to be above 3,245 (moving-avg: 4,689 - 3*sigma)`.

When you know what fields are broken, go to the Grafana piper data visualization dashboard to see the shape of the data that is missing. Use this graph to ask in the relevant Slack channel why the data is missing. Sometimes some other team releases a client with a bug that stops sending metric events to Tahoe, or maybe they intentionally broke the previous data without knowing that this would break our reports (example postmortem: https://docs.google.com/document/d/1Ywp0O8pi2cA3YTMwpZ_9AEqbxBemMWs77aZCK2R76RY/edit#heading=h.ldbyd1c0ffqi). Or maybe it is just a low day for Twitch and no one is using extensions. In any case, there should be an explanation for the data drop.

Once the problem is identified, maybe the report query needs to be updated and deployed. And then you may need to do a data backfill. If you have to re-run the report DAG to do a backfill, make sure that the fixed query can handle both old and new events (e.g. https://git-aws.internal.justin.tv/insights/piper/pull/284).

### Data Backfill

It is rare, but sometimes we find out that a report had bad data for a while. For example, maybe we find out that page-visits was not properly calculated. In those cases, we have to re-generate old data with a backfill.

In Piper, a backfill means that some tasks need to be clear, so the Rollbar scheduler will re-execute them again, to re-generate the reports. When reports are re-caulcated, their daily aggregation rows in Postgres are overriden with new/fixed data.

For example, to backfill the last 10 days for the extensions_overview_v2 report:

 * Go to Rollbar.
 * Open the DAG extensions_overview_v2 in the Tree View tab.
 * Mouse over each day DAG run to see the Started date. Find the day from where you want to backfill.
 * Click on the bottom task (tree view is backwards for some reason, the bottom task is the first.) to see actions.
 * Click on Clear, with the options "Future", "Downstream" and "Recursive", so all the task all the way to today are also clear.
 * In a few seconds, Rollbar will start running all those tasks in order. Reload the page to verity that it is running as expected.
 * Every daily report may take a few hours, so it may be a while since the full backfill is completed.


### SSH into the Airflow master host

Use the AWS Session Manager: go to the AWS console > service EC2 > select the instance you want to connect and click on the Connect button. Then choose "Session Manager" option.

### Checking Airflow Journals (logs)

While piper worker logs are in S3 and available from the airflow dashboard, some runtime information (particularly for the scheduler and website) are
 only available in the journals kept on the master.  SSH into the master host and use `sudo journalctl -e -f -u airflow@scheduler` or
 `sudo journalctl -e -f -u airflow@webserver` to bring up the journals.

### Fixing DAG task failures

Most DAGs are configured to retry tasks 3 times before reporting the error to Rollbar. If that happens, Rollbar will page us in Pagerduty. If you have to handle one of this alerts, just go to the Airflow Dashboard and identify the DAG that failed.

The task logs are accessible through the Airflow Dashboard (click on a failed task, and then in "View Logs"). Checking the logs is usually the first step to debug the issue (maybe some data was missing, or updated with unexpected values).

When the problem is fixed, just "Clear" the task. The Airflow scheduler will start the pending tasks again within a few seconds (need to reload the page to see it). If you don't want the task to run again (maybe solving the problem involved manually executing the task), you can also mark it as "Success" (tell Airflow that the task is done).

Example of task failure alert:

  * Piper retries the task 3 times, then reports an error to Rollbar.
  * Rollbar is configured to page us in Pagerduty. The error may say something like `ValidateExtensionsOverviewV2Fields: missing mandatory field`.
  * Kevin is oncall and is paged. He goes to the Airflow Dashboard and checks that the `extensions_overview_v2` DAG has failed the last run.
  * Kevin goes to Grafana, checks the [Piper Data Visualizations Dashboard](https://grafana.internal.justin.tv/d/000001708/insights-piper-data-visualizations?orgId=1) to see if anything looks odd. Effectively, the last value for `renders_uniq_channels_7d` is zero.
  * A conversation starts on Slack about what could have caused this. Eventually we resolve that the value drop is caused because of a metrics bugfix that caused this value to be missing.
  * The field is renamed, even fixed, etc.
  * After fixing the issue, Kevin decides to "Clear" all the tasks in the DAG run, so the metrics are re-generated. If the failing code were on the validation task, Kevin could just clear the validation task to avoid re-running everything for that day.


### Bounce Airflow Master Services (restart)

SSH into the master airflow box (the same box that the airflow dashboard is hosted on) and enter
 `sudo systemctl restart airflow@webserver` and `sudo systemctl restart airflow@scheduler` to bounce these two key services.

Another way to restart the service is by doing a re-deploy from [clean-deploy](https://clean-deploy.internal.justin.tv/#/home/insights).

### Connecting to Aurora

Get the URLs from the AWS account using the RDS dashboard. The database is `daily_incremental`. Passwords are in sandstorm. Getting sandstorm password for example is:

```bash
sandstorm get --profile twitch-service-piper-aws --role-arn "arn:aws:iam::734326455073:role/sandstorm/production/templated/role/sandstorm-agent-piper-service-prod" insights/piper-daily-incremental/prod/incremental_02
```

Connecting to daily_incremental using user incremental_02

```bash
psql -h piper-daily-incremental-1.cluster-ro-cxgpuxtkiepo.us-west-2.rds.amazonaws.com -p 5432 -U incremental_02 -d daily_incremental
```

### Data Recovery

Piper doesn't really own any data. The CSV reports are generated from aggregated metrics that are loaded from other sources (Tahoe, Discovery, Gringotts, etc.). However, the ability to quickly show CSV data from the last 90+ days in our CSV reports depends on having the aggregations for every day available in the `piper-daily-incremental` database.

The `piper-daily-incremental` database is an AWS RDS Aurora, with write and read replicas. If you look in the `twitch-service-piper-aws` or `twitch-service-piper-dev` accounts, you will see some instances on the RDS service. The `Snapshots` are daily backups that can be used to restore the data (Click on the snapshot and then "Restore Snapshot"). After restoring the data, make sure to keep the master-replica setup. If only the replica was lost, then you could alternatively create another replica. If only the master was lost, then you could promote the replica to be the new master, and then create another replica. There are different options to restore the rollups data.

Once data is recovered from the snapshot or replica, make sure to re-run Airflow DAGs on the last days (the snapshot may be a few days old).
