<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
**Table of Contents**  *generated with [DocToc](https://github.com/thlorenz/doctoc)*

- [db-s3-glue](#db-s3-glue)
  - [How it Works](#how-it-works)
  - [Getting Started](#getting-started)
    - [AWS Configuration](#aws-configuration)
      - [Network](#network)
      - [Secrets Management](#secrets-management)
      - [Alerting](#alerting)
    - [Initial Terraform Configuration](#initial-terraform-configuration)
      - [Schema Management](#schema-management)
      - [Output S3 Bucket Configuration](#output-s3-bucket-configuration)
    - [Registering a new Tahoe producer](#registering-a-new-tahoe-producer)
    - [Validating Setup](#validating-setup)
      - [Debugging Support](#debugging-support)
      - [Parameter tuning for performance](#parameter-tuning-for-performance)
  - [Debugging Issues](#debugging-issues)
  - [Migrating from Earlier Versions](#migrating-from-earlier-versions)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

# db-s3-glue

This repository houses a Terraform module for creating AWS Glue jobs to dump database contents to s3,
followed by an import into a configured Tahoe producer schema.

**Databases supported at the moment**: Aurora and PostgreSQL RDS databases, Redshift,
and DynamoDB tables. Support for other database types are evaluated on request.

## How it Works

One "watcher" Glue job will spin up a snapshot of the RDS database (not necessary for DynamoDB)
and then launch runs of the other "table" Glue job for each table being exported.
The "table" Glue jobs use Glue and a given per-table configuration to write Parquet files to
s3 files in your account encrypted with a newly-created KMS key (or your own key).
After a "table" Glue job completes execution, the "watcher" job will interact with the Tahoe API and
import the generated Parquet data so it can be queried later through Redshift Spectrum as a table using a
producer schema you'll configure at the time of setup. It then creates a view in the "dbsnapshots"
schema pointing at the generated table.

Cloudwatch logs are also encrypted with a separate
new KMS key. Failures in the "table" Glue jobs are propagated to the "watcher", and failures in the
"watcher" are reported to an SNS topic of your choice via Cloudwatch Events.

The "watcher" Glue job is triggered every 24 hours by default, but it's configurable with
`trigger_schedule`. We recommend running it shortly after midnight PT (e.g. 8 or 9 AM UTC), assuming
the latest snapshot is past the midnight boundary at this time. This allows Sheik to have complete
data for the previous day when it runs.

The initial design doc is available [here](https://docs.google.com/document/d/1fu50kXT7_s3umN5-a5OtByz5lhXd6Hr1GvYU05wyNLA/edit?ts=5b884c51#).

## Getting Started

Setting up a new Glue ingestion flows entails the following high level steps:

1. [Setting up some AWS pre-requirements](#aws-configuration)
1. [Deploying AWS Glue jobs through terraform](#initial-terraform-configuration)
1. [Registering a new Tahoe producer](#registering-a-new-tahoe-producer)
1. [Validating the setup with an end to end test](#validating-setup)

The first three steps can usually be completed within a couple of hours, but the last step can take
anywhere from a couple of hours to a couple of weeks as some setups are more demanding than
others. If you have a staging database, then we recommend setting up a staging configuration
first and use that to get things running and allow for faster iteration in the future.

As you follow the steps, don't hesitate to raise blockers early on
[#data-infrastructure](https://twitch.slack.com/archives/C03C62S07) and we'll prioritize the
support appropriately.

### Service Ownership Model & Escalation Path
At present, service ownership between Data Platform and teams using DB-S3-Glue is the following:

1. Data Platform owns the DB-S3-Glue framework and will provide ongoing on-call support which includes helping with 1) necessary framework level fixes; 2) debugging ETL job failures; 3) on-boarding new teams; and 4) escalation with AWS.
2. Teams are responsible for maintaining DB-S3-Glue infrastructure, customized ETL code and will ensure the jobs have monitoring and alerting enabled to detect failures
3. While debugging DB-S3-Glue if you encounter challenges, feel free to:
    - Please follow the [runbook](https://git-aws.internal.justin.tv/twitch/docs/blob/master/datainfra/playbooks/db-s3-glue.md) provided by Data Platform.
    - If the issue is still unresolved, contact Data Platform [on-call](https://support.di.xarth.tv/) with 1) detailed description of the problem; 2) debugging based on the runbook; and 3) severity of the business impact. 
    - Once the ticket is submitted, Data Platform on-call will engage the customer, provide assistance and assess the next steps. If required, Data Platform will [escalate](https://wiki.twitch.com/display/AD/AWS+Support+Guide+and+Communication) the issue with AWS. (note: at present we do not provide any SLA on the ETL jobs)


### AWS Configuration

#### Network

Glue must be run in a private subnet with a NAT and ideally an S3 VPC endpoint. If you are using RDS,
the database must be accessible from that subnet. The subnet must also support reverse IP lookups
This requires the VPC to have `enableDnsHostnames` and `enableDnsSupport` set to `true`, as outline
[here](https://docs.aws.amazon.com/glue/latest/dg/set-up-vpc-dns.html).
If it's using AWS DNS (most VPCs aren't), that's all that's necessary, but if it's using Twitch DNS,
entries needed to be added for its IP range around
[here](https://git-aws.internal.justin.tv/systems/puppet/blob/master/hiera/default.yaml#L576).
For example, a subnet that's `10.192.12.0/22` needs entries for:
```
    - 12.192.10.in-addr.arpa
    - 13.192.10.in-addr.arpa
    - 14.192.10.in-addr.arpa
    - 15.192.10.in-addr.arpa
```

You can ask #systems for these reverse DNS delegations to be added.


#### Secrets Management

If you are using RDS, you must create an encrypted SecureString Parameter in
[AWS Systems Manager's Parameter Store](https://us-west-2.console.aws.amazon.com/systems-manager/parameters?region=us-west-2)
to hold the database password. It will need a KMS key to encrypt the password, and that same key
you'll be providing later during the Terraform configuration.

You'll need to also create a separate SecureString Parameter to store the API key assigned to your
Tahoe producer during registration, which you'll obtain later as part of the Terraform configuration.
For now, make sure you have a KMS key to encrypt that parameter as well.

#### Alerting

Errors on Glue are reported to an SNS topic that you provide. We recommend using whatever SNS topic
you use for low-priority alarms since failures to export need to be fixed but aren't immediate fires.

### Initial Terraform Configuration

You'll need to define several [variables](variables.tf) as inputs to the module.
You will need to be on aws provider [~>](https://www.terraform.io/docs/configuration/providers.html#version-provider-versions) 2.44
and terraform 0.12.x. If you aren't currently on this version of terraform, you can use [tfenv](https://github.com/tfutils/tfenv)
to switch versions.

Deploying resources for the first time is a two step process:

1. First deploy should be without a `tahoe_producer_name`, `tahoe_producer_role_arn`,
`api_key_parameter_name`, and `api_key_kms_key_id` specified. After you deploy, take note of the
`s3_output_bucket`, `s3_kms_key` and `glue_role` provided as output for the next step.
1. Follow instructions to [register a Tahoe producer](#registering-a-new-tahoe-producer), which
describes the process to not only define the producer but also how the Terraform configuration should
be updated to use it.

We provide example Terraform configurations for [RDS](examples/rds.tf) and
[Dynamo](examples/dynamo.tf). But depending on your setup, you might need to consider the following:

- For Aurora and Redshift:
  - Make sure your cluster's security group accepts requests from all IPs on the subnet you pass to Terraform.
  - For Aurora, you will want a configuration much like the RDS one, but you need to set `skip_snapshot = "True"`
  and `aurora = "True"`, and you won't need `rds_subnet_group`.
  - For Redshift, use the RDS config, but set `database_type = "redshift"`, `skip_snapshot = "True"`,
  and don't set `rds_subnet_group`. See the caveat below about slow Redshift unloads.
  - Redshift table UNLOADs through Glue have a one-hour timeout due to IAM restrictions.
  This is circumventable by setting an IAM role (see https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-redshift.html), but one hour is plenty of time unless you have a very large and skewed table.
  Please contact [#data-infrastructure](https://twitch.slack.com/archives/C03C62S07) if your UNLOAD is taking over an hour.
  - Also, for Redshift, the `_scratch/` key prefix in your output s3 bucket will be used for Glue to
write temporary data to.

#### Schema Management

Defining the table schemas is done in the `table_config` [variable](variables.tf#L68). Multiple tables can be defined in the
same configuration.  
Schema updates **always** require incrementing the `version` number so changes can be reflected all the way to the Tahoe tables/views propagated across clusters. Any change made to the `schema` or `output_fields` fields are considered updates.

For large tables (except on Redshift or Dynamo), make sure to set `hashexpression` or `hashfield` to parallelize reads. For
more details, check the [performance tips section](#parameter-tuning-for-performance).

If you are using Dynamo, you will need the example [cleaning code](examples/dynamo.tf#L33) if you have documents missing some
fields or you want to extract fields from arrays/structs.

#### Output S3 Bucket Configuration

The `s3_output_bucket` variable defines the S3 bucket to which the Glue jobs write the encrypted
output to. By default the bucket, its policy and KMS key are created for you, which is the
recommended approach. Because the goal of most Glue exports should be to use this bucket as an
intermediate place to push data into Tahoe, then the bucket is also created with a 90 day retention
policy.

But if for some reason you have to provide your own bucket or KMS key (maybe you have extra
requirements), then you can prevent their creation by setting `create_s3_output_bucket = "0"` or
`s3_output_kms_key_arn = "0"`, respectively. The policy of the resource would need to be updated
manually in such case to match what is defined by [glue.tf](glue.tf).

### Registering a new Tahoe producer

To allow other teams at Twitch to consume the latest exported Parquet data through a [Tahoe Read Replica](https://data.xarth.tv/getting_started/datasources.html#tahoe-read-replicas),
you'll need to register as a Tahoe producer. Doing so will allow the "watcher" Glue job that you deploy to interact with the
Tahoe API so data becomes available through Redshift Spectrum as a table in a schema that's created for you at the time of
registration.

Before you can register your producer, you should have an `s3_output_bucket` and `s3_kms_key`
from the previous [Terraform configuration](#initial-terraform-configuration) step.
These will be the `s3_source_bucket` and `source_kms_arn` for your Tahoe producer,
respectively. You can then follow [this](https://data.xarth.tv/tahoe_producers/registering_your_application.html)
guide to register a new Tahoe producer. Keep in mind the following as you go through the steps:

- We recommend using the same name for Resolver Group, CTI, and Bindle. A good naming convention to follow is
`Glue<Type>Export-<DBName>`, where `<Type>` would be either `RDS` or `Dynamo`. Example: `GlueRDSExport-Blueprint`.
- The "Principal Role" you specify for the Tahoe producer must be your account's root, i.e. `arn:aws:iam::<account number>:root`
- The name of your producer is more restrictive so we recommend the `<dbname>dbexport` convention, where `<dbname>` can
only be alphanumeric and needs to start with a letter.
- Take note of API key you get back after registering your application (in the `keys` field of the response),
  create a SecureString [Parameter](https://us-west-2.console.aws.amazon.com/systems-manager/parameters?region=us-west-2)
  with that value, and update the `api_key_parameter_name` with the name of the Parameter and `api_key_kms_key_id`
  with the ID of the KMS key you used.

Once you complete the steps, you specify both `tahoe_producer_name = "<your_producer_name>"` and
`tahoe_producer_role_arn = arn:aws:iam::331582574546:role/producer-<your_producer_name>` to the terraform and
deploy again. After this your configuration should be good to go! If you have a Tahoe Read Replica that you would
like for it to consume from this new schema, reach out to [#data-infrastructure](https://twitch.slack.com/archives/C03C62S07) for support.

### Validating Setup

You can manually run the "watcher" or "table" jobs through the
[AWS Glue console](https://us-west-2.console.aws.amazon.com/glue/home?region=us-west-2#etl:tab=jobs).
For the "watcher" job, you can specify `--preserve_cluster` as `1` to keep the RDS cluster
around after the job finishes. The next run will automatically recreate it unless you run it
with `--use_existing_cluster` as `1`.
By default, the job will take the current time and truncate to the nearest noon or midnight UTC
to determine the time to snapshot (for RDS) and the time to write in its output (RDS and DynamoDB).
For RDS, you can override this time by passing `--ts` as `YYYYMMDDTHHMMSS`. For DynamoDB, it can't
be overridden since it's loading from the live table instead of a snapshot.

Jobs will usually take about 8 minutes to start cold to spin up Glue resources in the background,
but they'll run faster if run again a few minutes after a previous run.

You can try using
[Dev endpoints](https://docs.aws.amazon.com/glue/latest/dg/console-development-endpoint.html)
for a better debugging cycle, but we were unable to test them  due to a lack of an S3 VPC endpoint in our VPC.

#### Debugging Support

If your previously-working Glue job is failing, ensure the underlying schema of the table hasn't
changed. If it has, please update your `table_config` with the new schema and increment the version.

Our playbook for db-s3-glue is
[here](https://git-aws.internal.justin.tv/twitch/docs/tree/master/datainfra/playbooks/db-s3-glue.md)
for debugging your job. Please open a [DI support desk ticket](https://support.di.xarth.tv/)
when you start debugging if you would like our help.
When doing so, please add POSIX group `scieng` to an admin-like role on your account
in Isengard (unfortunately the Glue roles don't give enough permissions to meaningfully debug issues).
You can revoke access at the end.

#### Testing Changes

When testing changes to an existing job, it can be helpful to create a second job to avoid breaking
the existing production job. The [qa_for_job_name variable](variables.tf) can be used to
set up a second job that runs independently and writes into the `qa_dbsnapshots` schema instead
of `dbsnapshots`. See that variable's definition for details on how to use it.

#### Parameter tuning for performance

Everyone's data is a little different, so this section serves as some general information for tuning
the parameters for processing your data.

[A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory](https://aws.amazon.com/glue/pricing/).
This is controlled by tuning `worker_count` on a per table basis. For the default `Standard`
`worker_type`, one worker uses 1 DPU and has 2 executors. Previously, some jobs had memory problems
and needed `G.1X` or `G.2X` worker types which have fewer executors per DPU, giving each executor
more memory.  The `spark_optimization` removes an embarrassing amount of Glue overhead by
doing direct Spark operations, so the `Standard` worker should work with any size tables, and the
job will run much faster (we've seen large table jobs go from 8+ hours to 15 minutes).
The main trade-off for the `spark_optimization` is that you must do Spark operations in the
`spark_cleaning_code`, which can be a little trickier to write than the equivalent Glue-based
`Map.apply` in `cleaning_code`, but the performance gain is worth it for all but the smallest tables.
See [variables.tf](variables.tf) for more information.

If you weren't using `Standard` workers, you should strongly consider using the `spark_optimization`
instead and switching back to `Standard`. Because larger worker types aren't recommended, the
rest of this section only applies to `Standard` workers.

After setting a `worker_count`, the `Max capacity` in the dashboard will reflect the `worker_count`
plus one for reasons. Glue reserves one DPU of capacity (more on larger worker types), presumably
for master operations. Glue's docs say to use capacity instead of worker counts for Standard workers,
but the API accepts worker counts and adds one to do the capacity.

So, after you set a `worker_count`, each worker has two executors, but one executor is held
out by Glue, so you end up with `2 * worker_count - 1` executors, which you can see in the
CloudWatch metrics for the job as `glue.driver.ExecutorAllocationManager.executors.numberAllExecutors`.

For reading `hashpartitions` or `dynamodb_splits_count` (see below), each executor has 4 vCPUs,
and there's one extra reader, so up to `4 * executors + 1` can read at a time. This equation
combines with the equation from the previous pargraph to give you `8 * worker_count - 3` readers.
You can see the number of readers in the `/aws-glue/jobs/logs-v2` CloudWatch log group by going to
the `<job-run-id>-progress-bar` log stream.

A line like `[Stage 0 (parquet at NativeMethodAccessorImpl.java:0):=====> (12 + 5) / 24]`
means that 12 steps have finished, 5 are running, and there are 24 total to do. Thus, only 5 workers
were working at a time to do 24 tasks. Going up to either 24 workers or a factor of 24 (e.g. 6 or 12)
will help distribute the work more evenly and cause the job to be faster.

Another interesting CloudWatch metric is `glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors`,
which tells you how many executors Glue thinks you need. It doesn't seem to account for the extra
reader beyond the 4 per executor, so I've seen it be one higher than necessary when comparing it
to the number of tasks actually running simultaneously in the logs.

The number of tasks to run generally equals `hashpartitions` or `dynamodb_splits_count`, so those
two paramters tie tightly to the `worker_count`. See below for more about what they mean.


##### RDS/Redshift

`hashpartitions` is the number of concurrent connections that will be made to your RDS database.  A
`hashfield` (column name) or a `hashexpression` (SQL query) tells the JDBC connector you are using how to evenly
distribute data between parallel readers of your database.  [It is best if the resultant value is
evenly distributed](https://docs.aws.amazon.com/glue/latest/dg/run-jdbc-parallel-read-job.html).
By default only 7 `hashpartitions` are used for your JDBC connection.  This is a weird number for
them to have chosen, and you probably want to change it using the recommendations of the previous
section.

The main concern with `hashpartitions` is that if you have a high `hashparitions` and a high
`worker_count`, you can overload your database. If you're using a snapshot, this is less of a problem,
but I recommend starting with 4 `worker_count` and 21 `hashpartitions` for larger tables and
scaling up from there if load is acceptable.

The smallest tables can have 1 `worker_count` and 5 or fewer `hash_partitions`.

Another RDS-specific tuning parameter is `rds_instance_class_override`, which can be used
to create a snapshot with a larger DB instance type. It is most useful when you use a
small DB instance type but have a lot of historical data.

##### DynamoDB

To control the parallelism of your DynamoDB export you should be setting `dynamodb_splits_count`
and the `read_ratio` in the table config.  The former controls how many concurrent readers there are
against your table (like `hashpartitions` above) and is controlled by the same logic, though it more
frequently runs into throttling issues if there are too many splits and not enough allocated read
capacity, so you'll want to scale it down if you are being throttled (Glue seems to fail pretty
catastophically when throttled, reading the table multiple times).

`dynamodb_splits_count` works essentially the same as `hashpartitions`, so I recommend reading
the previous two sections to understand how it works (though there is no equivalent of `hashfield`
or `hashexpression` because the table's primary key is always used for that). The same caveats
around table load apply to the combination of `worker_count` and `dynamodb_splits_count`, and the
ratios of them recommended above still apply.

The `read_ratio` controls
what percentage of the provisioned read capacity all of the readers should consume. For autoscaling
tables, it uses the provisioned read capacity when the job starts, ignoring later autoscaling,
so it will run faster or slower depending on the capacity when it starts. For on-demand tables,
the `read_ratio` seems to be a fraction of the total available underlying capacity, so it can be
fairly low.

`dynamodb_splits_count` and `read_ratio` correspond to `dynamodb.splits` and `dynamodb.throughput.read.percent`
[here](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-dynamodb)
respectively if you want more information.

## Migrating from Earlier Versions

Since the initial release, there have been several changes that will require action on your part:

1. `job_name` has been added. Previously, `cluster_name` determined what cluster to read from
   and what all Terraform resources were named. Making `job_name` separate allow multiple jobs
   reading from the same table. You can set it to your `cluster_name` if you want to avoid
   recreating the Terraform resources.
1. The security configuration is now in Terraform. You can either delete your existing one via the
   Glue AWS console or import it into the terraform state:
   `terraform import module.{module_name}.aws_glue_security_configuration.export {job_name}-export`.
   You will get an `AlreadyExistsException` on the security configuration if you don't do this.
1. `table_config` is no longer a list of lists. It is now a list of dictionaries. Each list
   was previously `["field_name", "type_name"]`. It should now be
   `{"name": "field_name", "type": "type_name"}`. You can also now specify "sensitivity", which
   should be set for any Customer-level data, e.g. user IDs or IPs.
1. `table_config` can use `"` instead of `'` if you use heredoc `<<EOF`, which makes it more
   obviously JSON. We will be removing support for `'` eventually.
1. You will need to set up the Tahoe import (see above).
1. We moved from `tahoe_producer_lambda_id` and `tahoe_stack_account_id` to `tahoe_producer_role_arn`.
1. The variable `create_s3_output_bucket_policy` was removed. Instead now the policy is always
   created when `create_s3_output_bucket = "1"`, otherwise it needs to be defined manually.
1. If you have any large tables, setting "hashexpression" or "hashfield" will drastically speed
   up your load.
1. We removed the ability to set a yarn memory overhead limit because it caused jobs to wait for
   a very long time to get started.  Instead you should specify a `worker_type` in your config to
   use different memory characteristics.
1. We upgraded to Terraform 0.12.x.  You need to `terraform init` and update your provider.
1. Multiple DynamoDB tables can now be unloaded in one job. `dynamodb_splits_count` moved from a
   top-level variable to being part of the `table_config`. You can now ignore `cluster_name`
   for DynamoDB jobs.
1. `dpu_count` has been renamed to `worker_count` to better reflect how it is used.
