Benchmarking
------------

`consul_monitor_analyzer.py` is a utility that provides stats on `consul monitor -log-level debug` data.

Live Statistics
---------------

Only live http calls are supported at this time.

To gather live http client statistics: - This does not include remote client calls

```bash
consul monitor -log-level debug | ./consul_monitor_analyzer.py --pipe --interval 10
```

To gather all http server statistics:

```bash
while : ; do pkill -SIGUSR1 -x consul; sleep 10; done &
tail -f /opt/consul/log/consul-error.log | ./consul_monitor_analyzer.py --pipe --telemetry
```

Example output:
```bash
14:38:25 >> GET:   16 2.00qps (100.00%) non-GET:    0 0.00qps (  0.00%) waits:   15 1.88qps ( 93.75%) stale:   15 1.88qps ( 93.75%) Total: 16
14:38:35 >> GET:    9 2.25qps (100.00%) non-GET:    0 0.00qps (  0.00%) waits:    9 2.25qps (100.00%) stale:    9 2.25qps (100.00%) Total: 9
14:38:46 >> GET:    8 1.60qps (100.00%) non-GET:    0 0.00qps (  0.00%) waits:    7 1.40qps ( 87.50%) stale:    7 1.40qps ( 87.50%) Total: 8
14:38:57 >> GET:    7 0.88qps (100.00%) non-GET:    0 0.00qps (  0.00%) waits:    6 0.75qps ( 85.71%) stale:    6 0.75qps ( 85.71%) Total: 7
14:39:07 >> GET:   15 2.50qps (100.00%) non-GET:    0 0.00qps (  0.00%) waits:   15 2.50qps (100.00%) stale:   15 2.50qps (100.00%) Total: 15
14:39:17 >> GET:    7 0.78qps (100.00%) non-GET:    0 0.00qps (  0.00%) waits:    6 0.67qps ( 85.71%) stale:    6 0.67qps ( 85.71%) Total: 7
14:39:27 >> GET:   13 1.86qps (100.00%) non-GET:    0 0.00qps (  0.00%) waits:   12 1.71qps ( 92.31%) stale:   12 1.71qps ( 92.31%) Total: 13
```

Bulk Statistics
---------------

Gather data on all intended hosts. File must be named with FQDN at minimum, E.g.: `<fqdn>.log`. This is required to identify the host and pop name for analysis.

### Gathering Statistics

* Create a list of servers to be used by `pssh`. This data can be gathered via `consul members -wan`
* To gather `consul monitor` logs on servers. E.g. Gathering for 10 mins:
  ```
  parallel-ssh -h consul-servers.txt -e err -o out 'nohup timeout 10m consul monitor -log-level=debug > $(hostname -f).consul.monitor.log &'
  ```
* Rsync data to a local directory `results`
* Run analytics via `./consul_monitor_analyzer.py -d results --summary`

Usage
-----

At minimum, either `-d` or `-p` has to be provided dependent on usage.

```
usage: consul_monitor_analyzer.py [-h] (-d DIR | -p) [--time-trend-by-dc]
                                  [--summary] [-i INTERVAL] [-t] [--detailed]
                                  [--timezone-offset TIMEZONE_OFFSET]

optional arguments:
  -h, --help            show this help message and exit
  -d DIR, --dir DIR     dir of files
  -p, --pipe            Analyze consul http calls from stdin
  --time-trend-by-dc    time series trend by DC. Pass --detailed to break down
                        by individual host
  --summary             Summary of all call types. Pass --detailed to break it
                        down by dc
  -i INTERVAL, --interval INTERVAL
                        Interval in seconds. Defaults to 60
  -t, --telemetry       Can only be used with --pipe. Expects telemetry data
  --detailed
  --timezone-offset TIMEZONE_OFFSET
                        Timezone offset in hours. This is to account for some
                        host that are configured in UTC instead of PST.
                        Defaults to 8

```

Useful commands
---------------

* Ping another consul server follower
```
ping $(consul operator raft list-peers | grep follower | awk '{print $(NF-3)}' | tail -n 1 | awk -F : '{print $1}')
```

* Check number of problemetic consul members
```
while : ; do consul members | egrep -v 'alive|left|Protocol' | wc -l; sleep 2; done
```

* Monitor QPS every 5 seconds
```
consul monitor -log-level=debug | python3 /opt/loadgen/consul_monitor_analyzer.py --pipe --interval 5
```

* SSH to each server from each cluster - VPC name must be the same as your username
```
cssh -l ubuntu $(printf "%s\n" {1..3} | xargs -n 1 -P 10 -I xxx aws --profile twitch-vidcs-dev ec2 describe-instances --filters "Name=tag:consul-servers,Values=$(whoami)-consul-test-*" "Name=instance-state-name,Values=running" "Name=tag:Name,Values=*xxx-server*" | jq -r '[.Reservations[].Instances[].PrivateIpAddress][0]')
```

* Tail consul logs
```
tail -f /opt/consul/log/*
```

* Tail load generation logs
```
tail -f /var/log/loadgen.log
```

Tests
-----

```bash
pytest
```

Additional Reading
------------------

* [AWS Consul Test Google Docs](https://docs.google.com/document/d/17jHZZVdcX6sY9dI2DsCfgibJnyufc6HFx0M9YOaBYeg/edit#)

TODO
----
* Fix unnecesary use of `--timezone-offset` because most hosts report time in PST/PDT and some hosts are in UTC.
