ETL Pipeline Implementation
---------------

## Requirements
* [science "requirements"](https://docs.google.com/document/d/18i3V9hQryLr8ZarvghxMqk7U_BjcNTJgxgfF7OLeEgE/edit#heading=h.km5qd03ozf9)
  * tl;dr not many requirements, and I've distilled them the following
    * each table from each database d8a manages
      * rails-postgres
      * tmi-postgres
      * discovery-postgres
      * cohesion
      * usherdb
      * vods
    * output to S3
    * data is no older than 5 hours

## Assumptions
* output format is csv
* add an ETL box to each cluster with a large hard drive, and not much else for dumping csv files of the db
* dynamically get table names from databases using `sqlalchemy` bundled with `airflow`
* `Copy (SELECT * from table) TO /tmp/{db-name}_{table-name}.csv DELIMITER ',' CSV HEADER`
* `s3cmd put /tmp/{db-name}_{table-name}.csv s3://d8a/etl/{db-name}/{table-name}/{timestamp}.csv`
  * assumes s3cmd is present on ETL box, with credentials
* set TTL of 60 days on a file
  * requires boto3, the python package, and proper `~/.aws/credentials` file


## Airflow Implementation
* tmi-postgres DAG
![airflow dag](./tmi-etl.png)

* airflow code

```python
def ensure_s3_bucket_exists(bucket):
    s3 = boto3.resource('s3')
    try:
        s3.meta.client.head_bucket(Bucket=bucket)
        return True
    except botocore.exceptions.ClientError as e:
        return False


def set_expiration_on_s3_object(bucket, key):
    s3 = boto3.resource('s3')
    bucket = s3.Bucket(S3_BUCKET)
    obj = bucket.Object(key=key)
    expiration_date = date.today() + timedelta(days=60)
    obj.put(Expires=str(expiration_date))

db_engine = create_engine(engine_string)

tmi_tables = db_engine.table_names()

default_args = {
    'owner': 'd8a',
    'depends_on_past': False,
    'start_date': datetime(2016, 6, 1),
    'email': ['d8a@twitch.tv'],
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 5,
    'retry_delay': timedelta(minutes=5),
    # 'queue': 'bash_queue',
    # 'pool': 'backfill',
    # 'priority_weight': 10,
    # 'end_date': datetime(2016, 1, 1),
}

dag = DAG('tmi-etl', default_args=default_args, schedule_interval='@once')

s3_bucket_check = PythonOperator(task_id='s3-bucket-check', python_callable=ensure_s3_bucket_exists,
                                 dag=dag, op_kwargs={'bucket': S3_BUCKET})



for table in tmi_tables:
    table_csv = "/tmp/chat_depot-{table}.csv".format(**locals())
    table_schema = "public"
    timestamp = date.today().strftime("%Y-%m-%d-%H")
    s3_key = "etl/{DATABASE}/{table}/{timestamp}.csv".format(**locals())
    t1 = PostgresOperator(
        sql="Copy (SELECT * from {table}) TO '{table_csv}' DELIMITER ',' CSV HEADER".format(**locals()),
        task_id="{table}-to-csv".format(**locals()),
        postgres_conn_id=("tmi-postgres"),
        dag=dag
    )
    t1.set_upstream(s3_bucket_check)


    t2 = SSHExecuteOperator(
        ssh_hook=SSHHook(conn_id='tmi-ssh'),
        bash_command="s3cmd put {table_csv} s3://{S3_BUCKET}/{s3_key}".format(**locals()),
        task_id="s3-put-{table}".format(**locals()),
        dag=dag
    )

    t2.set_upstream(t1)

    t3 = PythonOperator(task_id='set-s3-expiration-on-{table}'.format(**locals()),
                        python_callable=set_expiration_on_s3_object,
                        dag=dag,
                        op_kwargs={'bucket': S3_BUCKET, 'key': s3_key})

    t3.set_upstream(t2)
```
