Gamified Project Signals Design
===============================

## Documentation
* End user-facing wiki: http://link.twitch.tv/rps
* Slide deck: http://link.twitch.tv/dta-rps-preso
* API docs: draft started in this document as draft and then also in the code
  repository
* Playbook: to be written, in the code repository as markdown

## Communication channels
* rps-team@ mailing list
* rps-announce@ mailing list
* #rps slack channel
* JIRA: http://link.twitch.tv/dta-rps-jira

## Frontend

The web client is written in React and simply hosted via static HTTP serving. It
talks to the backend REST/JSON API endpoint `/api/v1` on the same port that it
was served from.

[Design doc for the web client](../client/web/README.md)

## Backends

The backend is a single HTTP service binary running in a Docker container in an
AWS Elastic Beanstalk managed environment. The Beanstalk environment includes an
Elastic Load Balancer and an auto-scaling group of EC2 instances running the
Elastic Container Service agent which manages the Docker containers on each
instance.

This single binary is stateless and presents a set of services:

* HTTP/1.1 static serving of the web UI client.
* HTTP/1.1 RESTful JSON API provided by the grpc-gateway reverse proxy
* HTTP/2 gRPC hosting several microservices

All of these run from one TCP port which is "muxed" between the HTTP/1.1 and
HTTP/2 services.

In the current implementation, TLS is disabled and the services are running on
port 80. There is also a "debug" port running on port 8080 that provides access
to Go expvars and profiling data.

The "classic" style Elastic Load Balancer is configured like so:

Listening Port | Listener Protocol | Instance Port | Instance Protocol | Description
:------------: | :---------------: | :-----------: | :---------------: | -----------
80             | TCP               | 80            | TCP               | Main services port
443            | HTTPS             | 80            | HTTP              | Uses \*.internal.justin.tv certificate
8080           | HTTP              | 8080          | HTTP              | Debug port

The primary service port must be "TCP" because Elastic Load Balancers do not
understand HTTP/2. The "application" style load balancers are documented to
"support" HTTP/2 but can only accept HTTP/2 connections which are converted
to HTTP/1.1 (without TLS) in order to connect to the instance. gRPC requires
full HTTP/2 connections in order to support streaming, TLS, and fast
bi-directional RPC.

Eventually, the main service port will be TLS on port 443.
Either the load balancer will be ignorant of the TLS and naively forward the TCP
connection, the TLS will be terminated at the load balancer which will then
have a separate TLS connection to the backends, or AWS will have implemented
a better solution for HTTP/2.

In addition to the serving, the backend also polls an SQS queue looking for
requests:
* blueprint "ingest" to update project metadata from blueprints stored in
  GitHub.
* GitHub status collection for commits pushed to the default branches of
  repositories.

### Authentication and Authorization

Authentication hasn't yet been implemented.

It is planned that the backend services will reject requests that aren't
accompanied with valid OAuth tokens in the request headers. Guardian will be
used to validate the OAuth tokens and provide an API to lookup groups for
users to use for authorization. TLS will also need to be implemented as a
prerequisite.

The backend does do its own authentication to GitHub using an OAuth token that
is stored in Sandstorm.

### API

The primary API the backends present is a set of proto3 gRPC microservices
defined in .proto files.

There is a grpc-gateway reverse HTTP/1.1 proxy running ont he main port handling
requests to `/api/v1`. The grpc-gateway presents a RESTful JSON API that is more
appropriate for the web UI to talk to. It also creates a Swagger file from the
gRPC/proto3 schema so Swagger clients can easily talk to it and makes
human-readable HTML from the Swagger to document the API, served at `/api/swagger`.

### gRPC microservices

There are 3 microservices being served by the backend.

The ProjectMetadataService is a simple service to list known projects, get
project metadata about them, and update that metadata. This information is used
in order to allow the MetricService to correlate "project" with resources such
as GitHub repositories and Jenkins jobs. Also included is the association with
developers, teams, and organizations. Other services are also welcome to consume
this data for their own purposes.

The EventService allows queries of time series of events stored in the events
datastore. Queries can be filtered by time range, event type, project, and
developer.

The MetricService allows queries of named metrics by time range, bucketed by
time slices, and filtered by project, developer, team, or organization.
Metrics calculators are easy to add using a plugin/registry system and declare
what levels they can be filtered at.

For example, you may query for a metric named "code_churn" for July-September,
by week, filtered by project "Skadi". The service, in turn, first does a query
against the ProjectMetadataService to find out what GitHub repositories that
project uses. Then it does queries to EventService to get all "GitHub-push"
type events during that same range for those repositories.

Example metrics:
* open/unresolved urgent tickets
* median days between releases
* median days between commit and deploy
* % build cycles red
* % test coverage
* % changes with positive review
* median days PR spends in review
* % failed health checks
* Incidents
* Code churn
* Pager duty

### Logging

The infrastructure provided by the Elastic Beanstalk solution stack allows
logs to be streamed to CloudWatch log groups. Log groups are currently
configured to retain log entries for 7 days.

In addition to the standard Beanstalk, EC2, and AWS agent logs, the following
logs are also streamed:
* Combined stdout and stderr from `rockpaperscissors`.
* An HTTP access log in Apache "combined" format.
* A gRPC request log in JSON format

### Monitoring and Alerting

Monitoring is implemented with CloudWatch alarms defined in Terraform.
See `./terraform/modules/rockpaperscissors-alerts` for the list of
configured alarms.

The alarms are configured to send messages to an SNS Topic that
PagerDuty is subscribed to so alarms cause incidents to be created
in PagerDuty and the on-call notified.

### API Metrics & Instrumentation

The backends already use some common metrics collection and instrumentation
infrastructure provided by the `twitchserver` and `chitin` libraries including:
* Uncaught panics are reported to Rollbar.
* Periodic Go metrics are sent to our graphite/statsd instance.
* Trace headers are injected into gRPC requests and responses.

In addition, the HTTP access log and gRPC request logs both have metrics about
request latencies.

### Test Plan

There are currently unit tests with excellent coverage and generation of
coverage browsers.

A setup for integration or end-to-end tests still needs to be written as does
the criteria for release acceptance.

## Event Publishers and Ingest Pipeline

The key data that this system works with are "events" from development-related
activities such as when developers commit code, when deployments are made, when
tests are run. From these events, metrics about overall project health are
calculated.

### Publishers

Publishers create events and put them onto the "event bus". The event bus is
currently implemented as an AWS Kinesis stream. The "event" is a protobuffer
that has a unique ID, a timestamp, a type, and an opaque binary blob that is in
some format specific to the event type.

There is an "eventpublisher" Go library and CLI that makes it easy to create and
publish the events. Currently, it simply uses AWS credentials and IAM to have
write access to the Kinesis stream.

#### GitHub

GitHub webhooks can be configured for organizations/repositories to send
HTTP POST requests to lambda-based API Gateway. That lambda creates events from
the requests and publishes them to the event bus.

If the event is too large to be put on the event bus, then the event is stored
in S3 as a "large event".

#### Jenkins

A Jenkins plugin named `rockpaperscissors-jenkins` written in Java to publish
information about builds and tests onto the event bus.

#### Phabricator

Some code reviews are conducted in Phabricator. It would be useful to also
publish events about code reviews so we could calculate metrics such as whether
positive reviews were completed, sizes of reviewed changes, and latency of
reviews.

#### Skadi/Courier

Deployment and rollback metrics could be calculated from events published from
Skadi or Courier about deployments.

#### Jira

A Lambda queries JIRA every hour and does some queries for production incidents
and open bugs and publishes events based on that to the event bus.

### Event Stream Consumer

The event bus is consumed by an AWS Lambda function written in Python. This
function is invoked by AWS's internal polling of the Kinesis stream. The
function reads the Kinesis stream in chunks. For each event,
it validates that the event is a valid protobuf, has required fields, and then
simply stores that event into the event datastore (a DynamoDB table).

If the event isn't valid, it currently, it simply logs the occurrence and
drops the invalid event. It would be better to change this to implement a
"dead letter queue" in S3.

If the event is too large to be stored in DynamoDB, it's stored in S3 as a
"large event". Handling of large events hasn't yet been implemented. The
plan is to introduce an extra field in events that indicates it's a large
event and where in S3 to fetch the body of the event.

One nice advantage of this separate lambda function is that persisting the
events into the datastore isn't dependent on the backend services being
available. The storage code is simple and should be quite stable over time but I
expect that the code for the backend services may have a lot of churn. Ingesting
the event stream is very important so I want to make sure that it has the
highest availability.

Currently, this "event_store_lambda" processes each event in about 9.1ms. If
events are being sent to the event bus at a sustained rate that is faster than
that, then we will need to develop a better solution. Hopefully, the API
provided by the eventpublisher library will remain stable and event publishers
will be isolated from this detail.

## Project Metadata Ingestion

Project metadata is ingested from what I term "blueprint" files that are stored
in GitHub. Changes to ".blueprint" files in repositories are automatically
picked up by listening on the event bus for GitHub push events that change a file
ending in ".blueprint" on the root of the tree on the default branch of the
repository. When such an event is seen, a request is added to an AWS SQS queue to
"ingest" that blueprint. The backend has a poller that consumes that queue and
then talks to GitHub (using an OAuth token stored in Sandstorm) in order to
"ingest" the new version of the blueprint and update the project metadata in the
datastore.

This aspect of this system is needed to scratch my own itch of needing to
correlate activities from several services with a logical "project" but I could
see that many other systems and services may also want to consume this
information.

Also, we may want to move more configuration out of tool- and service- specific
configuration files into a more generic ".blueprint" file. For example, Jenkins,
Manta, and Courier configs could be moved to this one blueprint file.

### Blueprint Formats

The current format of this "blueprint" file is currently a text-formatted
ProjectMetadata protobuf file but that format may change in the future. I could
not find a canonical source of documentation about the protobuf text format but
it's relatively simple to understand from an example.

Blank lines and comments starting with `#` are stripped out of the file before
passing it to the protobuf text unmarshaler.

The [schema definition of ProjectMetadata](../proto/project_metadata.proto)
has inlined commented documentation of each field and it's type.

Example:
```
project_id: "rockpaperscissors"
project_name: "Rock Paper Scissors"
tech_lead: "fullht"
developer_email_list: "rps-team@twitch.tv"
source_repositories: <
  github_repository: <
    name: "dta/rockpaperscissors"
  >
>
issue_tracker: <
  assignee: "fullht"
  jira_project: <
    project: "SYS"
    component: "RPS"
  >
>
code_review: <
  reviewer: "fullht"
  github_pull_request: <
    upstream_branch: "master"
    repository: <
      name: "dta/rockpaperscissors"
    >
  >
>
```

At some point, the blueprint file format may be changed to be in a programming
language. I'm exploring using a functional, sandboxed, embeddable language to
create more of a DSL. ML-derived languages are currently my favorites for this.

Advantages include:
* A more compact and friendly syntax.
* A documented language that includes conditionals and string munging.
* Possibly having "libraries" of functions to set smart reasonable defaults.
* "Client"-side validation including required fields and validating the
  expectations of the contents of string fields.

For example, a future blueprint that is equivalent to the above may instead look
more like:
```
blueprint(
  id = "rockpaperscissors",
  name = "Rock Paper Scissors",
  tech_lead = ["fullht"],
  developer_email_list = "rps-team@twitch.tv",
  github = GitHub("dta/rockpaperscissors"),
  jira = Jira(project="SYS", component="RPS")
)
```

## Datastores and Data Formats

### events table

The events DynamoDB table stores all of the events given to the event bus.

Attribute  | Type                  | Description
---------- | --------------------- | -----------
uuid       | binary, partition key | 128 bit UUID
timestamp  | decimal, range key    | Unix epoch seconds with fractional seconds
type       | string                | type of the event
body       | binary                | Event protobuffer
attributes | JSON map              | key/value pairs to filter events with

Note that the "body" here is actually the Event protobuffer, which also includes
these same fields. These attributes here could be seen as just a "projection" of
the fields within the protobuf so DynamoDB can query/filter on them. The "body"
field within the protobuf is the actual event which is an opaque blob with a
format selected by the event type.

The names of types, formats of the bodies, and available attributes for each
type of event are [documented here](events.md).

### projects table

The projects DynamoDB table stores metadata about projects.

Attribute        | Type                  | Description
---------------- | --------------------- | -----------
project_id       | string, partition key | short name of the project
project_name     | string                | nice display name for the project
project_metadata | binary                | ProjectMetadata protobuffer

Like the events table, the project_id and project_name attributes should be
considered to be "projections" of the same fields from the project_metadata.
They allow DynamoDB to query/filter by those attributes and allow quick listing
of projects without having to unmarshal every project_metadata protobuf.
