# Trace Architecture #

## Components ##
Trace is built as a pipeline organized into with 6 subservices:

 - **[barrel](https://git.xarth.tv/release/villagers/tree/master/cmd/barrel)** is a daemon which listens for Trace events from services, validates them, and ships them to Trace Collectors.
 - **[collectors](https://git.xarth.tv/release/trace/tree/master/cmd/pbcollect)** receive events from Barrel, aggregate them into large blocks, and flush the blocks to Kinesis on the trace-events stream.
 - **[statsdsink](https://git.xarth.tv/release/trace/tree/master/cmd/statsdsink)** reads the trace-events topic and emits statsd metrics related to the events it's getting.
 - **[aggregators](https://git.xarth.tv/release/trace/tree/master/cmd/aggregate)** read from the Kinesis trace-events stream too, but they combine events that share Transaction IDs into large, coherent Transactions. They aggregate Transactions into blocks, and flush the blocks to Kinesis on the trace-transactions stream.
 - **[api](https://git.xarth.tv/release/trace/tree/master/cmd/api)** provides gRPC-based access to the Kinesis trace-transactions stream.
 - **[txreport](https://git.xarth.tv/release/trace/tree/master/cmd/txreport)** reads the trace-transactions stream, and every 30 seconds it writes an HTML report to disk describing the 30-second slice it observed. This is hosted at [trace.internal.justin.tv](https://trace.internal.justin.tv/).

In terms of dependencies, you can think of it like this:
```
        barrel
          |
          |
      collectors
         / \
        /   \ (via kinesis trace-events)
       /     \
statsdsink   aggregators
                /\
               /  \ (via kinesis trace-transactions)
              /    \
            api   report
```

This should help visualize the way problems flow downhill: If the collectors are broken, we'd expect statsdsink, aggregators, api, and report to all be unhealthy too. If the aggregators are the root cause, we'd expect statsdsink to be fine.

## Hosts ##
**Barrel** processes are running on thousands of hosts at Twitch. In Puppet terms, all clusters that have `twitch_barrel` in their hiera YAML (like [Cohesion](https://git.xarth.tv/systems/puppet/blob/568ed0e3daf22fa0cb99b1ffcb0cf65f307a2868/hiera/cluster/cohesion.yaml#L4), for example) have a barrel process running. The barrel processes run directly on the services' machines.

**Collectors** run in the [trace-collect](https://us-west-2.console.aws.amazon.com/ec2/autoscaling/home?region=us-west-2#AutoScalingGroups:id=trace-collect;view=details;filter=trace) Autoscale Group. There's also a [trace-collect-canary](https://us-west-2.console.aws.amazon.com/ec2/autoscaling/home?region=us-west-2#AutoScalingGroups:id=trace-collect-canary;view=details;filter=trace) ASG with exactly one host; it receives production traffic along with the rest of the collectors, but can be deployed to separately to test code. As of 2015-12-08, there are 12 collectors.

**Statsdsink** runs in the [trace-statsdsink-kinesis](https://us-west-2.console.aws.amazon.com/ec2/autoscaling/home?region=us-west-2#AutoScalingGroups:id=trace-statsdsink-kinesis;view=details;filter=trace) ASG. There's a [trace-statsdsink-kinesis-canary](https://us-west-2.console.aws.amazon.com/ec2/autoscaling/home?region=us-west-2#AutoScalingGroups:id=trace-statsdsink-kinesis-canary;view=details;filter=trace) too, which functions just like the collector's canary. As of 2015-12-08, there are 12 statsdsink hosts.

**Aggregators** run in the [trace-aggregator-kinesis](https://us-west-2.console.aws.amazon.com/ec2/autoscaling/home?region=us-west-2#AutoScalingGroups:id=trace-aggregator-kinesis;view=details;filter=trace) ASG. There's a [trace-aggregator-kinesis-canary](https://us-west-2.console.aws.amazon.com/ec2/autoscaling/home?region=us-west-2#AutoScalingGroups:id=trace-aggregator-kinesis-canary;view=details;filter=trace) too, which functions just like the collector's canary. As of 2015-12-08, there are 12 statsdsink hosts.

**API and Report** run in single-instance ASGs, [trace-api](https://us-west-2.console.aws.amazon.com/ec2/autoscaling/home?region=us-west-2#AutoScalingGroups:id=trace-api;view=details;filter=trace) and [trace-report](https://us-west-2.console.aws.amazon.com/ec2/autoscaling/home?region=us-west-2#AutoScalingGroups:id=trace-report;view=details;filter=trace). The only function of the ASG is to make sure that an instance is replaced if one crashes because of disk usage or something. There are also one-host canaries for these services, just as in the others.

### Working with ASGs ###
Each of Trace's services run on Autoscale Groups. This has the advantage of automatically replacing broken instances - we've seen a few instances die for mysterious cloudy reasons, but also had a few die because of full disks. The ASG automatically fixes those sorts of problems.

The downside is that the instances never get DNS records, so the only way to SSH onto them is with their actual private IP address. They have `/etc/hostname` set to human-readable values, so when they report stats (like to statsd or ganglia) they'll *appear* to have usable hostnames like trace-collect-da168403.prod.us-west2.justin.tv, but this won't actually resolve if you try to use it.
