# Splash Mountain

This is a collection of terraform, ansible, and packer scripts that set up an ELK stack in EC2 which can also push logs to S3 for post analysis using tools like Spark.

## Environment

The following environment variables must be set for commands below to work:

```
AWS_ACCESS_KEY=<take a guess>
AWS_SECRET_KEY=<take another guess>

AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY
AWS_SECRET_ACCESS_KEY=$AWS_SECRET_KEY
```

You'll also need to have the ids_management private key, ask Jorge.  Add it to your keychain with `ssh-add ~/.ssh/ids_management`

Finally, you'll need to have terraform, terraform-inventory, and ansible installed.

### Installing Ansible
```
brew install ansible
```

### Installing Terraform
The development verison of terraform is now required as it fixes quite a few annoying bugs.  You'll need at minimum go 1.4 installed, GOPATH set, git, and mercurial.

```
brew install git
brew install mercurial

go get -d github.com/hashicorp/terraform
cd $GOPATH/src/github.com/hashicorp/terraform
make updatedeps
make dev
```

You will now have terraform installed to $GOPATH/bin

### Installing terraform-inventory
Terraform inventory is a simple inventory tool for ansible that pulls information from terraform's tfstate files to create ansible inventory.  Installing it is simple via go get

```
go get github.com/adammck/terraform-inventory
```

You'll now have terraform-inventory in your path.  The `hosts` scripts found in this repo will make use of it in conjunction with tfstate files.

## Usage
Currently only a production environment is defined, a `dev` environment is currently being developed.

### Terraform

Terraform is currently used purely for infrastructure management and not any configuration management, although there is a provisioning script to automate ansible provisioning post instance creation.  These are the currently defined environments:

#### Production
This is the production definition for our ELK stack (elastic search, logstash, kibana).  Please do not directly run terraform on these files without first consulting Jorge (jorge@justin.tv).  Production contains the following resources:

- ids-logstash-ingest
  - elb + instances + route53 for logstash ingest
- ids-elasticsearch
  - instances + route53 to spin up elasticsearch with 4 TB of storage each
- ids-kibana
  - small kibana serve + nginx (WIP)

### Ansible

Ansible is currently used for configuration management on instances post terraform creation.  The following sections detail how to use ansible in the environments that are currently available, any commands detailed should be run from that environment's directory.

#### Production

Before running any ansible commands:
```
git pull origin master # pull most recent master to ensure terraform is up to date
terraform refresh # ensure the local state matches the AWS State
```

To run a full dry run:

```
ansible-playbook -i hosts site.yml -C
```

To apply configuration changes to all hosts:

```
ansible-playbook -i hosts site.yml
```

To partially apply configuration changes (note the asterisk):

```
ansible-playbook -i hosts -l ids-kibana* site.yml
```

# TODO

- ~~lock down kibana with ssl~~
- add elb + route53 for kibana

# Bugs

## Elasticsearch throughput

Currently there is an issue in our elasticsearch cluster that crops up where an initializing or relocating index will stall and take multiple days to complete.  During this time, throughput drops multiple orders of magnitued and we're unable to index all of our data correctly.  There is currently a manual workaround for this.

### Identifying the issue

When this issue is occurring, the following things can be observed:

- ids-kibana.internal.justin.tv shows a dramatically reduced document count
- ids-elasticsearch.internal.justin.tv/_plugin/kopf shows one node at or near 100% cpu
- the node shown to be at 100% will have one shard in the most recent index either relocating or initializing
- the recovery status will show that shard as being in the "translog" stage, and the translog progress will be very slow

This issue is currently being tracked in the following github issue: https://github.com/elastic/elasticsearch/issues/9226

### Workaround

The following steps can be followed to restore the cluster to a working state.  To issue the API commands, either use the rest panel in kopf, or curl directly against elasticsearch.  For the purposes of the following examples, we will be assuming:

- The broken index is logstash-2015.03.25
- The broken shard is 4
- The broken node is ip-10-192-74-220

Replace all three of these values with your own if running through these instructions.

1. Disable shard allocation - (kopf lock button on clusters tab does this too)

	```
	PUT /_cluster/settings
	{
	    "transient" : {
	        "cluster.routing.allocation.enable" : "none"
	    }
	}
	```

2. Cancel the broken shard allocation

	```
	POST /_cluster/reroute
	{
	    "commands": [
	        {
	            "cancel": {
	                "index": "logstash-2015.03.25",
	                "shard": 4,
	                "node": "ip-10-192-74-220"
	            }
	        }
	    ]
	}
	```

3. Trigger an index transaction log flush

	```
	POST /logstash-2015.03.25/_flush
	```

4. Enable shard allocation

	```
	PUT /_cluster/settings
	{
	    "transient" : {
	        "cluster.routing.allocation.enable" : "all"
	    }
	}
	```
