# zenyatta

* [zenyatta](https://zenyatta.justin.tv) is a web app to run and monitor d8a workflows
* uses [airflow](https://github.com/apache/incubator-airflow)
* hardware managed by [zenyatta/terraform](https://git-aws.internal.justin.tv/d8a/zenyatta/tree/master/terraform)
    * will be moved into this repo soon due to no dependencies on other terraform modules
* configuration managed by `systems/puppet`: 
    - [hiera](https://git-aws.internal.justin.tv/systems/puppet/blob/master/hiera/cluster/d8a-airflow.yaml)
    - [airflow module](https://git-aws.internal.justin.tv/systems/puppet/tree/master/modules/twitch_airflow/manifests)
    - [zenyatta module](https://git-aws.internal.justin.tv/systems/puppet/tree/master/modules/zenyatta)
* airflow's state is stored completely in a databse, right now RDS. 
  * this allows for the master node to fail due to load, or terraform fail without issue
  * bringing up a new master just means pointing the `airflow.cfg` value `sql_alchemy_conn` to the connection string of the RDS instance
    * an example is: `postgresql://user:password@airflow-postgres.rds.amazonaws.com:port/database`
    
# topology
# [architecture diagram](doc/zenyatta-infrastructure.png)
   * master ec2 instance that acts as a scheduler
   * worker ec2 instances that perform tasks
   * a redis instance (via elasticache) as a task queue. 
     * this is how all communication is handled between master (the scheduler) and the workers
     * uses [celery](http://www.celeryproject.org/) to power the task queue
     * workers can use queues in redis to direct tasks to specific nodes
        * `airflow worker -q <queue-name>`
        * option for number of workers is controlled by `-c` parameter. default is 16

# deployment
  * use [clean deploy](https://clean-deploy.internal.justin.tv/#/home/d8a/zenyatta) to build a production release
  * run puppet on all airflow nodes(master, and workers) to push the release
  * if there are any new connections in `/etc/zenyatta/connections.yaml`, or credentials have changed for a connection:
    
    ```shell
    ssh <airflow-master>
    sudo su - airflow
    source /etc/zenyatta/zenyatta.env
    cd /opt/twitch/zenyatta/current
    python init_db.py
    ```
    
  * you *may* have to restart the scheduler on the master node for changes to take effect.
    ```
    sudo svc -d /etc/service/airflow-scheduler
    sudo svc -u /etc/service/airflow-scheduler
    ```
  * you *may* have to restart the workers on the worker nodes for changes to take effect.
    ```
    sudo svc -d /etc/service/airflow-worker
    sudo svc -u /etc/service/airflow-worker
    ```
    * note that running tasks will run to completion when re-starting the worker. 
      * you can overcome this with: `sudo killall -u airflow`
      * you must restart the workers if you kill them
    
  * releases will only affect new tasks
    * ideally you would release when tasks aren't running
    * if tasks fail due to release, fear not! every pipeline can be backfilled with high quality data:
      ```shell
      ssh <airflow-master>
      sudo su - airflow
      source /etc/zenyatta/zenyatta.env
      cd /opt/twitch/zenyatta/current
      airflow backfill <pipeline id> -s YYYY-MM-DD -e YYYY-MM-DD
      ```

  
# provisioning
  * in [the airflow directory](https://git-aws.internal.justin.tv/d8a/provisioning/tree/master/airflow/twitch-web-aws)
    * master node
    * rds instance
    * asg
    * elb
    * security group
  * a new master is safe to bring up such that the existing RDS instance is untouched
  * in the event that the RDS instance must be nuked, first stop the scheduler on airflow-master so no new tasks are created or started

# TODO    
  
* testing:
    * flake8
    * code coverage
   
* resumability improvements:
    * after wal-e backup fetch is complete, md5 the directory and output to a file
      * upon resuming the pipeline in the future check to see if said file exists,
        * if file exists: md5 directory and compare
            * if md5's are equal, skip backup fetch
            * if not, redo backup fetch
        * else: do backup fetch
        
    * be more graceful if a table doesn't exist.. but this seems bad


# Notes on Airflow 
* airflow is a workflow manager
    * schedules workflows based on some period of time
    * workflows can also be triggered manually
    * can backfill a workflow over a period of days/months/years?
    * task state is stored in some type of relational database (postgres, sqlite, mysql, rds, etc)
        * and task meta data, like connection host, login, password, port, technology
    * can use celery to fan tasks out to celery workers
        * celery coordination happens through an elasticache instance (redis)
        * no additional config needed other than having a worker running on an instance pointing at the same redis server
          * workers can use queues to limit which tasks run on which workers
* airflow is not a streaming solution
    * airflow is not spark, or hadoop -- but can be used to orchestrate those technologies
    * the use case is "monitored units of work that I care about failure"
        * built in retries over a controlled period of time, and # of retries
* workflows are represented as a DAG
    * tasks that have a parent task can be dependent on those tasks, or not
    * if tasks are dependent on parents, and parents fail, that path through the DAG will also fail
    * tasks can pass messages
    * tasks get context when run (think time they're ran, etc)
    


# development workflows
### upgrade version of postgres for non-RDS pipelines
  * change postgres version in `zenyatta/docker/wal-pitr/Dockerfile`
  * build new container
```
docker build -t d8a/postgres-pitr .
```
  * then prepare and push it to AWS
```
# tag container in AWS Elastic Container Registery repo
docker tag d8a/postgres-pitr:latest 465369119046.dkr.ecr.us-west-2.amazonaws.com/d8a/postgres-pitr:latest
# push container to AWS Elastic Container Registery repo
docker push 465369119046.dkr.ecr.us-west-2.amazonaws.com/d8a/postgres-pitr:latest
```
 * ensure `postgresql-client-<version>` is installed on worker nodes via puppet in `systems/puppet/modules/zenyatta`
 * ensure worker ami is set in `/etc/zenyatta/airflow.cfg` via `systems/puppet/modules/zenyatta/templates/airflow.cfg.erb`
 * more info: http://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html



# Connection metadata json output
Zenyatta extracts tables from database into csv files and parquet format files and saved in an s3 bucket. Along with the database snapshot, a 'metadata' json file that describes the database schema, is also generated and saved in the same directory. 

  * metadata.json is the original meta file that shows the schema of a database. It only keeps the table name and column attribute and there is no way to detect the field that uniquely identifies items in the snapshot tables. 
  * metadata.v1.json is an updated version on the original. We decided to append the version number to the file name to keep track of the change. Besides column information, v1 also includes `primary_key`, `foreign_keys`, `indexes` and `columns`. Below is the schema of metadata.v1.json.
```
{
  connection_id :                                  # connection id is the key of metadata.v1 json file
    "tables": [                                    # "tables" consists of a list of table objects
      {
        table1 : {                                 # table1 is the name of a table object
          "primary_key": {                         # "primary_key" is they key of a primary_key object
            "columns": [ string, string, ...]      # "columns" is the key of primary keys
          },
          "columns" : [                            # "columns" is the key of a list of column objects
            {                                      # column object contains default, nullable, type, autoincrement and name
              "default": string,
              "nullable": boolean,
              "type": string,
              "autoincrement": boolean,
              "name": string
            },
            ...
          ],
          "indexes": [                             # "indexes" is the key of a list of index objects
            {
              idx1: {                              # index name is the name of an index
                "columns": [ string, string ... ], # "columns" is a list of column consisting the index
                "unique": boolean                  # "unique" indicates if index is unique index
              }
            }
          ],
          "columns": [                              # "columns" the key of a list of columns in table1
            {
              "default": string,
              "nullable": boolean,
              "type": string,
              "autoincrement": boolean,
              "name": string
            },
            ...
          ]
        }
      }
    ]
}
```

