#Dynamo to S3 Glue (db-to-s3-glue) Module

####[Glue Jobs in AWS Console](https://us-west-2.console.aws.amazon.com/glue/home?region=us-west-2#etl:tab=jobs) (be on `twitch-feed-aws`)

## current jobs
#### [graphdb_production_follows](./follows/README.md)

## terraform version

Required version: `0.12.29`

## what does this infra do?
`db-to-s3-glue` is process from data infrastructure that handles ETLing data from DynamoDB (in graphdb) into Tahoe (Twitch's data lake).
 The core module will create an AWS Glue job that leverages Apache Spark to handle this ETLing on a specified schedule. This job is a slim layer
 on top of the data infrastructure job. The [db-s3-glue "how it works" section](https://git.xarth.tv/dp/db-s3-glue#how-it-works) will give a great starting point. 
 
i.e.
 
 the job in `./follows` creates an AWS Glue job that does a daily ETL job at 9am UTC to read **all** 
 of the rows in `graphdb_production_follows` (the user follows table) and does some [transformation](./follows/follows_spark_cleaning_code.py) before the 
 job will handle publishing to the `dbsnapshots.follows` Tahoe table.  

## why is this separate? 

This module is separate because it needs to run on terraform 0.12, which `graphdb` is not on

## how to run `terraform <command>` for this module

Each subdirectory has its own tahoe job using the core module in `./modules/db-to-s3-glue`

i.e. to run the terraform for in `./follows` for the `graphdb_production_follows` job
```
chtf 0.12.29 // get to the right version
cd ./follows
terraform plan
```

## how to add a new dynamo-to-tahoe job

1. create a new directory and follow an existing example (i.e. `./follows/`)
```
mkdir <table_name>
```

You need the following files (follow `./follows` for a good example) 
1. backend to terraform remote state
1. spark python code on how to transform the dynamo data into the right Tahoe format
1. schema json file on the output data structure
1. create resource in a `main.tf` file using the core `./modules/db-to-s3-glue` module
    ```
    module "<your_table_name>_tahoe_export" {
      source = "../modules/db-to-s3-glue"
      team   = "<your_team_name>"
      name   = "<your_table_name>"
      ...
    ```
1. (_optional_) start off with a QA job to verify results. Refer to the Links section below to understand more how the QA job works

##Links:
[db-s3-glue repo](https://git.xarth.tv/dp/db-s3-glue)

[Data Infra Optimization Spec](https://docs.google.com/document/d/1JukKqRcEyKCr_vJDsV_6Hbql5mxIrpe2pUXVzZZEK4I/edit)
