[Incident Index](#incident-index) | [PagerDuty Onboarding](pagerduty_onboarding.md)

- - -

It's 3AM -- Smoca is on fire! What do you do? Have no fear - this runbook will try to walk you through it!

Smoca is integrated into [PagerDuty][1]. Immediately when a Smoca scenarios fails on [Smoca-Lambchops](../infrastructure.md#qa-smoca-lambchops), a PagerDuty notification will be triggered.

When Smoca reports no more errors, the PagerDuty incident will automatically resolve.

## Table of Contents
- [When Assigned an Incident](#when-assigned-an-incident)
- [Incident Index](#incident-index)
    - [QA Smoca](#qa-smoca)
    - [QA Selenium Grid](#qa-selenium-grid)
- [Escalating to Engineering On-Call](#escalating-to-engineering-on-call)

## When Assigned an Incident

Acknowledge it as soon as you're aware of the issue, and are preparing to look into it. If you don't acknowledge, the incident could escalate to Level 2 On-Call.

PagerDuty includes the failure message within it. See [an example here](https://twitchoncall.pagerduty.com/incidents/P1BFT1M).

Click "View In Smoca" within that incident to be directed to the Jenkins URL.

Immediately:
- Check [Smoca-Lambchops][Lambchops Link] for the log output
- Check [Health Metrics Dashboard][Health Metrics]

If you're on-call, it's not necessarily your responsibility to fix every issue you see. You're responsibility is to do initial investigation, and triage when necessary.


## Incident Index

Identify your issue based on Service. Search for "Impacted Service" on the PagerDuty Incident.

- [QA Smoca](#qa-smoca)
- [QA Selenium Grid](#qa-selenium-grid)

### QA Smoca

Common Exceptions (click each for more detail):

[Scenario Failed: Expected/Unable to find \[css, content, url\]](items/smoca/expected_to_find.md)

[API request error](items/smoca/api_request_error.md)

[Timed out waiting for page load](items/smoca/timed_out_page_load.md)

[RSpec::Core::MultipleExceptionError](items/smoca/multiple_exception_error.md)

[undefined method '\[\]' for nil:NilClass](items/smoca/undefined_method_nil.md)


### QA Selenium Grid

Common Exceptions (click each for more detail):

[CPU Utilization GreaterThanOrEqualToThreshold](items/selenium_grid/cpu_utilization.md)

[Smoca Session was terminated due to \[CLIENT_GONE, CLIENT_TIMEOUT\]](items/selenium_grid/session_terminated.md)

## Escalating to Engineering On-Call

If you validated the issue was not a Smoca issue (screenshots, replication, etc.), you'll want to escalate to on-call.

It's not concrete science. Use your judgment on classifying an issue. If you're ever unsure, post in Slack.

#### Severity
Does this impact our users on a wide level?

- Sev1
  - Process of Escalation: Post in #emergency. Open an incident with the appropriate on-call. Wake people up.
  - Examples: Twitch isn't loading at all. Login is broken. Directories are blank.
- Sev2
  - Process of Escalation: Post in #site-production. Use judgment to notify on-call via PD. Can also do an @ slack ping if it's non-urgent, but should be addressed.
  - Examples: Live email notifications aren't sending. Style sheets are occasionally not loading.

#### How to Re-Assign

You can either open the incident's link, click Reassign, and select the proper escalation path (see [Teams](#teams) below), or click the "Create a New Incident" on the front page of pagerduty.

Often after re-assigning, it's helpful to also ping them in Slack #emergency with further details.

##### Teams
In general:
- Identity: Log In / Sign Up Issues
- Search & Discovery: Directories blank. Jax isn't returning data.
- Level3 CDN: Issues with our CDN. High 5xx rates, style sheets broken.
- Product Engineering: Everything Else

[1]: https://twitchoncall.pagerduty.com/services/PW4RCBO
[Health Metrics]: https://grafana.internal.justin.tv/dashboard/db/smoca
[3]: http://grid.us-west2.justin.tv/grid/console
[Lambchops Link]: https://jenkins.internal.justin.tv/job/qa-smoca-lambchops/
