Mekansm
========


Mekansm is Twitch's middleware between the deployment stack and [AWS CodeDeploy](https://aws.amazon.com/codedeploy/).


Scope and goals
---------------

Mekansm, as a microservice, has a limited scope:

1. Work as an effective access layer to AWS CodeDeploy
2. Expose a self documenting REST API to consumers
3. Maintain a local cache of the CodeDeploy state


Use Cases
---------

Mekansm use cases are originally designed to solve the problems in the Deployment Validation project, but the design aims for Serendipitous reuse of the resources, and isn't coupled to Deployment Validation in any way.

1. Start a new deployment
2. Search through existing and past deployments, by:
    * Deployment
    * Project owner
    * Project repository
    * Environment (production, staging, etc)
    * Datacenter
    * Deployment Status
    * Host (node)
3. Get detailed information for a deployment
4. Get information for deployments in a node

Please read the [API Documentation](http://mekansm/) for details on how these use cases are solved as API endpoints.


Subsystems
----------

Mekansm provides resources via an HTTP API. The core application follows the [JSON API 1.0 specification](http://jsonapi.org/format/1.0/), and is self-documenting thanks to the [Swagger API framework](http://swagger.io).

Mekansm also exposes internal status and statistics via the [CPStats](http://docs.cherrypy.org/en/3.3.0/refman/lib/cpstats.html) package.


Interfaces
----------

1. [AWS CodeDeploy](https://aws.amazon.com/codedeploy/), the service that does the actual deployments and works as a source of truth on any information provided by Mekansm.
2. [PostgreSQL](http://www.postgresql.org), which Mekansm uses as a cache for CodeDeploy facts, and to store a convenient history of deployments.
3. [Pagerduty](https://www.pagerduty.com), for alerting and warnings


Implementation
--------------

Mekansm is written in [Python](https://www.python.org/), using the [CherryPy web framework](http://www.cherrypy.org/).

Access to AWS CodeDeploy is done using Amazon's official [Boto 3](http://aws.amazon.com/sdk-for-python/) library.

Access to PostgreSQL is done via [SQLAlchemy](http://www.sqlalchemy.org) and [Psycopg2](http://initd.org/psycopg/).

To read more about the internal implementation of Mekansm, please read the HACKING file.


Configuration
-------------

Mekansm configuration is done in a YAML file that is passed as an argument during initialization. An example config file:

    aws:
      -
        profile: dta
        region: us-west-2
      -
        profile: web
        region: us-west-2
      -
        profile: systems
        region: us-west-2
    database:
        host: 127.0.0.1
        port: 5432
        user: mekansm
        password:.
        database: mekansm
        engine:
            echo: true
            echo_pool: true
            strategy: threadlocal
            max_overflow: 40
    pagerduty:
        key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx



Security
--------

Access to AWS operations are done via separate account credentials configured in the servers running Mekansm, in the `~/.aws/credentials` file::

    [dta-group]
    aws_access_key_id=<YOUR ACCESS KEY ID>
    aws_secret_access_key=<YOUR SECRET KEY>
    
and `~/.aws/config` file::

    [dta-group]
    region = us-west-2
    [profile dta-group]
    region = us-west-2

Note the 2 entries. You *NEED* the 2 entries in `~/.aws/config` or botocore won't find the region of the AWS profile, making Mekansm think it is misconfigured.


Mekansm resources are configured behind AWS Security Groups.


Capacity Planning and availability
----------------------------------

Mekansm is a critical part of the deployment pipeline. It needs to handle spikes of requests from its consumers. A Mekansm outage, due to either the inability to handle traffic or AWS/PG being inaccesible, would mean that the organization won't be able to make new deployments.

Mekansm traffic is expected to be very low, and thus it's capacity needs should be very modest. That being said, at the moment of writing there hasn't been any work done on calculating the expected load given the historical data in Skadi DB. Mekansm should be able to handle at least 1.5x the highest expected spikes. See [DTA-836](https://twitchtv.atlassian.net/browse/DTA-836) for updates.


Installing, running and testing
-------------------------------

Please read the HACKING file.

Updating encrypted configs (postgresql/aws bits):
-----------------------------------------------------

# Pull down the current configs
`$ scripts/kms-decrypt.sh -a dta-mekansm-staging -d configs`
# Update the configs in the configs directory
# Push the updated configs back using the named key
`$ scripts/kms-encrypt.sh -a dta-mekansm-staging -d configs`

Updating the AMI including mekansm app:
------------------------------------------------------

# Build an updated AMI (if needed):
`$ packer build --only=amazon-ebs-prod ./packer.json`

# Run terraform to update the currently used image
`$ terraform plan -var 'ec2_ami=ami-ed02e38d' -var 'username=mekansm' -var 'password=mydbpassword'`
`$ terraform apply -var 'ec2_ami=ami-ed02e38d' -var 'username=mekansm' -var 'password=mydbpassword'`

