# Monitoring services

Setting up monitoring for new services is essential to ensure their reliability and notify on-call when something goes wrong.

The primary way we monitor our services is using nagios. You can browse nagios [here](http://nagios-clean.internal.justin.tv).

## How to add nagios checks

Nagios checks are in [ops/puppet/modules/twitch_nagios/files](https://git.xarth.tv/ops/puppet/tree/master/modules/twitch_nagios/files).

In `clean/objects/hostgroup.cfg`, add your hostgroup.
```
define hostgroup {
    hostgroup_name    foobar
    alias             foobar
    register          1
}
```

In `clean/objects/hosts/`, describe the hosts that your service runs on. The hostgroups field should be the same as the name you defined above.
```
define host {
    host_name                       foobar-1.sfo01.justin.tv
    address                         foobar-1.sfo01.justin.tv
    _graphite_address               foobar-1_sfo01_justin_tv
    _ipmi_address                   foobar-1-ipmi.sfo01.justin.tv
    hostgroups                      foobar
    use                             linux-server
}
define host {
    host_name                       foobar-2.sfo01.justin.tv
    address                         foobar-2.sfo01.justin.tv
    _graphite_address               foobar-2_sfo01_justin_tv
    _ipmi_address                   foobar-2-ipmi.sfo01.justin.tv
    hostgroups                      foobar
    use                             linux-server
}
```

In `clean/objects/servicegroups/`, define groups for the nagios checks you want to add.
```
define servicegroup {
    servicegroup_name               foobar load
    alias                           foobar load
    register                        1
}
define servicegroup {
    servicegroup_name               foobar disk_space
    alias                           foobar disk_space
    register                        1
}
define servicegroup {
    servicegroup_name               foobar memory
    alias                           foobar memory
    register                        1
}
define servicegroup {
    servicegroup_name               foobar health
    alias                           foobar health
    register                        1
}
define servicegroup {
    servicegroup_name               foobar cron
    alias                           foobar cron
    register                        1
}
```

In `clean/objects/services/`, define the nagios checks. The `servicegroups` field isn't required, only use it if you defined a servicegroup in the file above.
```
define service {
    hostgroup_name                  foobar
    service_description             load
    servicegroups                   foobar load
    use                             generic-service
    check_command                   check_nrpe!check_load
    register                        1
}
define service {
    hostgroup_name                  foobar
    service_description             disk space
    servicegroups                   foobar disk_space
    use                             generic-service
    check_command                   check_nrpe!check_hda1
    register                        1
}
define service {
    hostgroup_name                  foobar
    service_description             bond
    servicegroups                   bond
    use                             generic-service
    check_command                   check_nrpe!check_bond
    register                        1
}
define service {
    hostgroup_name                  foobar
    service_description             raid
    servicegroups                   raid
    use                             email_only
    notification_period             systems_hours
    check_command                   check_nrpe!check_raid
    register                        1
}
define service {
    hostgroup_name                  foobar
    service_description             memory
    servicegroups                   foobar memory
    use                             generic-service
    check_command                   check_nrpe!check_mem
    register                        1
}
define service {
    hostgroup_name                  foobar
    service_description             foobar health
    servicegroups                   foobar health
    use                             generic-service
    check_command                   check_http!-u /debug/health -p 8080
    contact_groups                  site_infrastructure
    register                        1
}
define service {
    hostgroup_name                  foobar
    service_description             consul
    servicegroups                   consul
    use                             generic-service
    check_command                   consul-alert-host
    contact_groups                  releng_email_only
    register                        1
}
```

## Custom checks

You can write custom checks, using python.

A nagios check is a standalone script. Its exit code determines the status of the check, and the output is displayed to the user in nagios.

Here is a simple nagios check:
```
import requests
from sys import exit

OK = 0
WARN = 1
CRIT = 2

def main():
    url = "http://localhost:8080/debug/health"

    try:
        r = requests.get(url)
    except requests.exceptions.RequestException as e:
        print "couldn't make request to /debug/health"
        exit(CRIT)

    if r.status_code != 200:
        print "/debug/health returned",r.status_code
        exit(CRIT)

    exit(OK)

if __name__ == "__main__":
    main()
```

Once you've written your test, add it to puppet at `monitor_import/foo_check.py`.
Then define a command that runs it in `clean/objects/commands.cfg`:

```
define command{
        command_name    foo_check
        command_line    $USER2$/foo_check.py
}
```

## Testing nagios checks

NOTE: A development nagios cluster is in the works, this is subject to change.

Push your changes to your puppet branch.

```
ssh nagios-clean.internal.justin.tv

sudo puppet agent --test --environment=<your branch>
```

This command will fail if your nagios configuration isn't correct.

Once puppet is finished, you should be able to see the checks in the nagios web interface.

## Finishing up

Submit your puppet branch for PR and mention Brandon Williams, the nagios master.
