# ENVOY
[Godoc](https://godoc.internal.justin.tv/code.justin.tv/common/envoy)

Envoy is a lightweight Go library for exposing the healthiness of your services. Simply initialize Envoy and
tell it about your states, and Envoy will present text and json responses about your service via HTTP on
a port you choose at the path `/health`. The important difference between Envoy and an external health checker
is that Envoy is embedded into your service and so can easily be given direct state updates by your service code.

If you had, for example, a service called the Query Aggregator, you might want to know the state of the
backends you talk to, your average query duration, and the ratio of successful queries to total number
of queries. You would first initialize Envoy like so.

```
envoy.Init(&envoy.Config{
	Service: "Query Aggregator",
	ServerConf: &envoy.HTTPServerConf{
		Host: "0.0.0.0",
		Port: "5555",
	},
})

backendState := envoy.NewState("backends", "Backends",
	envoy.ConditionUnknown)
queryTimeState := envoy.NewState("query_time", "Average Query Duration",
	envoy.ConditionNormal)
ratioSuccessState := envoy.NewState("ratio_success", "Ratio of successful queries",
	envoy.ConditionNormal)
```

Your states come up in the condition that you specify -- it is up to you as the application designer to decide the
correct condition based on what your service knows so far. Here we have chosen to start the `backends` state in the
`Unknown` condition to reflect that our service has not yet queried its backend list.

After initializing Envoy, you can query `/health` to see how your application is doing.
```
INFO: Service: Query Aggregator
INFO: Build: Pkg: code.justin.tv/common/envoy/example
INFO: Uptime: 7s
UNKNOWN: Global state; 1 state
UNKNOWN: Backends; Starting
NORMAL: Average Query Duration; Starting
NORMAL: Ratio of successful queries; Starting

```
All of the states are present and render our descriptions.

Now, let's imagine that we've queried the backends and found them successfully working. We've also started
to receive some production traffic, so our other states have measurements as well.

```
backendState.Normal("12 out of 12 connected")
queryTimeState.Normal("22 ms")
ratioSuccessState.Normal(".95")
```

After which you could query the healthiness of your service.


```
$ curl localhost:5555/health
INFO: Service: Query Aggregator
INFO: Build: Pkg: code.justin.tv/common/envoy/example
INFO: Uptime: 62s
NORMAL: Global State; 3 states
NORMAL: Backends; 12 out of 12 connected
NORMAL: Average Query Duration; 22 ms
NORMAL: Ratio of successful queries; .95

```
Envoy passes our descriptions of each condition on in its output. The text you give envoy is surfaced so that
members of your team can quickly understand why your states are in the conditions that they are in.


If you are checking programmatically, get the result in JSON. The JSON response will also provide the number of
elapsed seconds since we have last seen a particular condition for each state.

```
$ curl -H "content-type: application/json" localhost:5555/health | json
{
  "service": "Query Aggregator",
  "uptime": 2510,
  "current": "NORMAL",
  "states": {
    "backends": {
      "description": "Backends",
      "message": "12 out of 12 connected",
      "current": "NORMAL",
      "history": {
        "NORMAL": 2510
      }
    },
    "query_time": {
      "description": "Average Query Duration",
      "message": "22 ms",
      "current": "NORMAL",
      "history": {
        "NORMAL": 2510
      }
    },
    "ratio_success": {
      "description": "Ratio of successful queries",
      "message": ".95",
      "current": "NORMAL",
      "history": {
        "NORMAL": 2510
      }
    }
  },
  "build": {
    "pkg": "code.justin.tv/common/envoy/example"
  }
}
```

Now let's suppose that our service loses half of its backends as a result of a network partition. It can still
complete its queries successfully, but the loss of redundancy puts it in an endangered state.

```
backendState.Warning("6 out of 12 connected")
```

Which yields in a WARNING response.

```
INFO: Service: Query Aggregator
INFO: Build: Pkg: code.justin.tv/common/envoy/example
INFO: Uptime: 121s
WARNING: Global state; 1 state
WARNING: Backends; 6 out of 12 connected
NORMAL: Average Query Duration; 22 ms
NORMAL: Ratio of successful queries; .95

```

Envoy detects that one state is in the WARNING condition and automatically elevates the global condition to
WARNING as well. At this point we may choose to send an email to alert developers about this condition.

Finally, more time elapses, and the rest of the backends disconnect. The service is now largely unable to
complete any queries. Our service examines its internal state and then issues the updates to Envoy
accordingly.

```
backendState.Critical("0 out of 12 connected")
ratioSuccessState.Critical("0.1")
```

```
INFO: Service: Query Aggregator
INFO: Build: Pkg: code.justin.tv/common/envoy/example
INFO: Uptime: 187s
CRITICAL: Global state; 2 states
CRITICAL: Backends; 0 out of 12 connected
NORMAL: Average Query Duration; 22 ms
CRITICAL: Ratio of successful queries; 0.1

```

At this point, we would likely alert oncall that the service has suffered a complete loss of functionality.

Messages
---------

Envoy is meant primarily as a tool to diagnose an unhealthy service. As such, it is important that messages concisely
detail the reason for a state to be in a particular condition. You can see this in the example -- at each stage,
we have provided a short string which explains our criterion for that condition. As envoy does not know anything
about your service, you can put whatever you like here. The best messages are ones which will explain to you
or anyone on your team how to possibly remedy the problem, even without understanding anything about the service
itself.

Starting Conditions
---------

It is up to you as the service creator to choose appropriate starting conditions. You may wish to start Envoy before
you have taken measurements of your dependencies, especially if it might take some time to measure those dependencies.

If your service already knows what condition a state should be in when the state is created, then you should just
start the state in that condition. Consider passing the condition in programmatically after you've sampled whatever
it is that the state reflects.

If your service has not yet taken any samples, then you have a choice about how to pick the starting condition. In
general, if the state reflects some outside service that your service depends on, then `Unknown` is probably appropriate.
This will reflect the fact that the outside world exists in some state that you have not yet examined, and so it
would not be appropriate for your service to declare itself as healthy before it has done so.

On the other hand, if the state reflects something that you will measure as you begin to take requests, such as
a query duration or success rate, then `Normal` is likely the correct choice. Choosing `Normal` helps resolve a
chicken-and-egg situation where your service cannot join production until it is healthy but cannot be healthy
until it has seen queries. It might help to think of these sorts of states as healthy in the absence of evidence of
any errors.


Nagios
=========

At Twitch, Nagios is an important tool in alerting oncall to problems in our infrastructure. Envoy is
"batteries-included" in this regard -- you can easily query any envoy service from nagios with just a port number.
For example, here is a nagios check which verifies the healthiness of Birdcage, which is a service which supports
Video's Find service.

```https://git-aws.internal.justin.tv/ops/puppet/blob/c921346db0cb80069dd17b19100cf911fa049f1a/modules/twitch_nagios/files/dirty/objects/services.cfg#L2658-L2665
define service {
    host_name                       find1.sfo01
    service_description             Birdcage Health Check
    use                             generic-service-2010-08-17
    check_command                   check_envoy!9172
    contacts                        video_email
    register                        1
}
```

The `check_envoy` command does all that is necessary to create an alert. In this example, it will connect to port 9172,
where it expects to find envoy's health check at HTTP path `/health`. This check will examine the Global state
condition and then create any corresponding alerts. It will create a message from the standard text output
of Envoy and mail it, in its entirety, to the contacts/contact_groups listed. This means that each message
contains the complete state of the service, which removes any ambiguity about why the service is alerting.

Passive Checks
---------
Passive checks allow your Envoy instance to reach out to Nagios to inform it of your service's health status (the
"passive" part refers to Nagios' frame of reference -- Nagios has to do nothing to receive these). If you are
using the above template for your check, you don't have to do anything to accept passive checks -- they're already
enabled. Passive checks engage the same hard/soft state machine used by active checks, and Envoy will only
emit notifications on changes to the global condition. The advantage of using these is that you can jumpstart your
service check in nagios into the first soft state rather than waiting for Nagios to poll your service some minutes
later.


If you want Envoy to perform Passive Checks, you will need to tell it where the Nagios server is. For dirty hosts,
this is `nagios.internal.justin.tv`, and `nagios-clean.internal.justin.tv` for clean hosts. Be sure to use the
exact same string that names your Nagios service in the `Service` field of the conf struct.
```
envoy.Init(&envoy.Config{
	...
	NscaConf: &envoy.NagiosConf{
		Host: "nagios-clean.internal.justin.tv",
		Port: "5667",
		Service: "Query Aggregator Health Check",
		Node: "my-fqdn-1234ab.sfo01.justin.tv",
		NSCATimeout: envoy.DefaultNSCATimeout,
		NSCAPath: envoy.DefaultNSCAPath,
	},
})

```

Build SHA
=========
If you would also like to include the SHA of the commit used to build your service, use

```
"godep go build -ldflags \"-X code.justin.tv/common/envoy.buildSha $GIT_COMMIT\""
```

in the appropriate place in the build phase of your `.manta.json`. The environment variable for
the commit is already included in the standard way that we invoke manta. You will then find the SHA on
the same line as the package name in the text response

```
INFO: Build: Pkg: code.justin.tv/common/envoy/example SHA: 1f73366fc0441d13dbf09c70cef3b9cfb5282c90
```

and in the `build` hash of the JSON response
```
  "build": {
    "sha": "1f73366fc0441d13dbf09c70cef3b9cfb5282c90",
    "pkg": "code.justin.tv/common/envoy/example"
  }
```
Consul Service
=========
The puppet repo contains a module you can instantiate to advertise your Envoy listener. The service name is `envoy`,
and the module will include the tags `fqdn=$FQDN` and `service=$YOUR_SERVICE_NAME`. This will make it feasible to
gather all the Envoy ports for a particular host or service. It may be on the roadmap to create a dashboard which
will consume this list, although how that would work has yet to be decided.

To instantiate the module, just add this to your service's puppet module
```
envoy::service { $YOUR_SERVICE_NAME:
  port      => $YOUR_ENVOY_PORT,
}

```

Embedding in an Existing HTTP Server
=========
If your application already listens on a particular port for HTTP requests, you may choose to have
Envoy handle requests on `/health` rather than having Envoy start a new HTTP server for itself. If
you're just using the default Go server, the following line will correctly setup the handler.
```
http.HandleFunc("/health", envoy.HealthHandler)
```

Do this after calling `envoy.Init`, but simply omit the `ServerConf` in that call.

It is expected that Envoy will listen on `/health` on whichever port the server listens on, so
if this does not fit your path schema, then it is highly suggested that you use a new server.
Additionally, you should make sure that Envoy is never publicly accessible as it exposes
critical information about our infrastructure.
