# GraphDB Run Book

## Service or system overview 

### Business overview

GraphDB is a storage database that holds edge and node information, along with meta data information about the edges or
nodes.  It was designed as a replacement for Cohesion and currently is used for Twitch's following graph.

### Technical overview

The primary storage backend for GraphDB is DynamoDB.  Dynamo data is cached by elasticache/memcache to prevent hot keys.

### Service Level Agreements (SLAs)

We aim for 99% of single item read requests to respond in less than 50ms and 99% of single item write requests to respond
less than 100ms.

### Service owner

GraphDB is owned by discovery infra and is supported in slack on channel #graphdb

### Endpoints

DNS is managed by the [internal DNS tool](https://dashboard.internal.justin.tv/dns).  We support both HTTPS and HTTP
requests.

| Environment | HTTP | HTTPS | 
|-------------|-----| ----- |
| Integration | http://graphdb-integration.internal.justin.tv | https://graphdb-integration.internal.justin.tv |
| Staging | http://graphdb-staging.internal.justin.tv | https://graphdb-staging.internal.justin.tv |
| Production | http://graphdb-production.internal.justin.tv | https://graphdb-production.internal.justin.tv |

## Dependencies

### DynamoDB

DynamoDB is the source of truth for graph information.

#### Limits

DynamoDB uses auto scaling for read/write capacity.  When capacity jumps, it takes a few minutes for auto scaling to 
catch up.  We cap at 12,500 write units and 9,000 read units.

#### What could go wrong?

DynamoDB is pretty reliable, but can spike with high p99 request times.  There is nothing we can do about that.

DynamoDB could temporarily throttle requests to reads or writes while capacity auto scales.  If you don't want to wait
for auto scaling, you can manually increase the read/write capacities.

If list requests take too long, we can throttle on our circuit breakers and return 'concurrency reached' errors.  If
this happens consistently, the solution is to increase our concurrent request limit in the code.

### Elasticache

All requests to DynamoDB first hit elasticache.

#### Limits

We are limited to 210GB stored in the cache.  Memcache is generally network bound and we are limited to 7,000	Mbps
per host.

#### What could go wrong?

If we run out of space, we could start evicting items.  This shouldn't be a concern if it happens at a low rate.

Our traffic pattern could change to avoid the cache. This would cause p99 times to increase.  The solution is to either
prepopulate more data into the cache or resolve what caused traffic patterns to change up stream.

If our instances do too much network traffic, we could begin serving requests slower than needed.  The solution there
is to deploy more elasticache instances into the cluster.

### SQS

SQS allows GraphDB to do asynchronous requests (write this later, etc) and repair count changes.

#### Limits

There are no practical limits how how much traffic we can send to SQS.  We are limited on draining traffic from SQS
based upon the number of hosts draining from SQS (Each ECS task processes one message at a time).  You can tell
if our SQS queue is hitting a limit if the number of messages in the queue begins to rise.

#### What could go wrong

We could not process messages fast enough, causing the queue size to rise. This usually means each messages is taking
longer than usual to processes.  Check other systems for a bottleneck.  For example, maybe DynamoDB capacity needs
to be increased.

### ECS

ECS is where GraphDB is deployed.  GraphDB will auto scale tasks to keep total CPU usage < 60%.

#### Limits

We are limited to the number of hosts in the cluster.  If we reach this limit, increase the size of the ECS cluster.

#### What could go wrong

A deployment could fail if the cluster does not have enough spare CPU or memory to deploy more tasks.  We are usually
CPU limited.  To increase the CPU of the cluster, deploy more hosts into it.

### ALB

All traffic to GraphDB goes through an ALB before hitting GraphDB.

#### Limits

There are no practical limits to GraphDB's use of an ALB.

#### What could go wrong

If the DNS of the ALB changes, it is possible for traffic to not find GraphDB.  You can readjust the DNS using
https://dashboard.internal.justin.tv/dns/

### Slack bot

A slack bot allows moderation and safety teams to perform common read/write operations on GraphDB autonomously without
having to file a ticket for us.

#### Limits

The Slack bot is deployed behind a lambda and API gateway and is limited to lambda: 1,000 concurrent instances.  We are
unlikely to hit this limit

#### What could go wrong

The slack bot is a low priority service and support can be delayed without an issue.  If the bot has trouble connecting
to GraphDB, it may need a new URL set inside the SSM parameter store at https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Parameters:sort=Name   

The slack bot logs traffic to #graphdb-access-logs in slack. 

## API

### Twirp

Twirp is the main way to communicate with GraphDB and the API is documented in the proto files of the service.
The twirp service `GraphDB` is the recommended way to communicate with GraphDB and the twirp service
`datastorerpc.Reader` is a transitional way to communicate with GraphDB that generally mirrors cohesion's API.

GraphDB also supports cohesion's old /v1 HTTP API with the same URLs as cohesion's.

## System characteristics

### Traffic hours

Traffic to the service follows twitch's normal web traffic patterns.

### Data and processing flows

All data to GraphDB comes from the ALB.

### Resilience, Fault Tolerance (FT) and High Availability (HA)

The service is currently only deployed in us-west-2.  It requires coordination the DynamoDB layer and uses hystrix
circuits to control open/close and breaking to dependencies.


### Throttling and partial shutdown

The service cannot be directly throttled, however it internally uses Hystrix to communicate with all downstream
dependencies and the hystrix circuits will return errors back to users if anything downstream is circuit open.

To explicitly throttle the service, you can increase or decrease the allowed concurrent connections to each hystrix
circuit.


### Environments

The service promotes through 4 stages: integration, staging, canary, and production.

#### Integration

Integration is used only for integration tests and any data here could be wiped without notice.  Integration talks to
the staging environment of other services that don't have a dedicated integration environment.

#### Staging

Staging data is generally persistent and talks to other staging data at twitch.

#### Canary

Canary is a single instance behind the production ALB.  It has dedicated log groups and all metrics go to the 'canary'
environment

#### Production

Production shares the ALB with canary and gets live traffic.

## Required resources

15,000 req/sec require 108 CPU units.  Memory isn't an issue and the 1GB of memory per task is more than enough.
On disk storage isn't used.  Network bandwidth availability could become an issue and is monitored at the ECS level.

The service expects to scale up to 40,000 req / sec.  To increase that number, allow more task instances in ECS.

## Security and access control

There is no access control on GraphDB other than being on the VPN.

## System configuration

### Secrets

Secrets are managed via sandstorm at (https://dashboard.internal.justin.tv/sandstorm/manage-secrets)
SSH access to the cluster is managed by ldap groups.

### Configuration management

Configuration is managed by https://git-aws.internal.justin.tv/hygienic/distconf and stored in consul at
http://consul.internal.justin.tv/ui/dist/#/us-west2/kv/settings/feeds/production/

## System backup and restore

### Backup requirements

Only DynamoDB needs to be backed up, since it is the only source of truth for data.

Backups require point-in-time recovery to be enabled on all the DynamoDB tables.  You can learn more about those at
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/PointInTimeRecovery.html

### Backup procedures

Backups are done automatically by DynamoDB and can be restored to any time in the past 35 days.

### Restore procedures

Use Point-in-time recovery's UI to restore tables.  The tables will get a new name.  Then, update the node registry to
read from the new table names.

Make sure you restore both the counts and edges table at the same time from the same backup time.

## Monitoring and alerting

### Container stdout/stderr logs

All containers log their stdout and stderr to CloudWatch logs.  The direct link is https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logStream:group=feeds-production-container-logs

### Access logs

All rest/twirp requests to GraphDB log an event to CloudWatch group graphdb-production-access-logs.  The direct
link to search those is https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=graphdb-production-access-logs;start=PT30S

They log out as a JSON encoded protocol buffer.  Check out the file private.proto `message AccessLog` for the logged
fields.  Because the logs are JSON you can search them with the ClouWatch UI.  For example, `{$.status_code != "200" }`.

If you have the bash command `awslogs`, you can execute this to get the logs in stdout
```bash
AWS_PROFILE=twitch-feed-aws awslogs get graphdb-production-access-logs -f '{$.status_code!="200"}'
```

### Error logs

Errors are logged to rollbar at https://rollbar.com/Twitch/GraphDB/

### Metrics

Metrics are charted in grafana at https://grafana.internal.justin.tv/dashboard/db/graphdb-circuits

### Alerts

We use grafana alerts.  All alerts are viewable on the grafana dashboard and forward to pager duty.

## Operational tasks

### Code repository

Code is stored in https://git-aws.internal.justin.tv/feeds/graphdb

### Code Build

Jenkins builds GraphDB into a Docker image at https://jenkins.internal.justin.tv/view/feeds/job/feeds-graphdb/

### Deployment

Jenkins deploys GraphDB in stages integration->staging->canary->production using pipelines at
https://jenkins.internal.justin.tv/view/feeds/job/feeds-graphdb-pipeline/

All code deploys to staging automatically but requires manual confirmation to promote to canary and production.
