[![GoDoc](http://godoc.internal.justin.tv/code.justin.tv/release/trace?status.svg)](http://godoc.internal.justin.tv/code.justin.tv/release/trace)


[Trace](https://trace.internal.justin.tv/)
===

Overview
---
Trace builds cross-service callgraphs.  It observes relationships between
internal services as they are used to satisfy user requests.  As a public-
facing service gets help from other internal services or databases to
construct a response for the user, each interaction is noted and attributed to
the initial user request.

It uses a unique Transaction ID for each incoming user request, which is
propagated through to each internal service that does work as a result of the
user's request.  Services emit Event records which are collected for analysis.
A set of Events forms a callgraph for the involved distributed system, in the
form of a tree of remote procedure calls.  Each Event includes the Transaction
ID identifying the tree and a Path to the node it describes within the tree.

Each Event also includes additional information, such as the Kind of event
observed, the current time, and the name of the service, hostname, and process
id where the event was observed.  Events may also include information specific
to the RPC, such as the identity of the network peer, the HTTP method and
status code, or the normalized SQL query executed.  Event message format and
semantics are thoroughly documented in the [protobuf description
file](pbmsg/event.proto).


Architecture and Operations
---

![Data flow diagram](https://git-aws.internal.justin.tv/release/trace/wiki/static/dataflo2.jpg)

Trace's data pipeline is designed with three goals in mind:
- minimal impact on instrumented services
- scalability up through handling millions of events per second without resorting to sampling
- security of data throughout the pipeline, primarily in terms of confidentiality

We achieve these, respectively, by
- accepting data over UDP via a host-local process
- dividing collection and aggregation of data into separate steps, shuttling the data around with [Kafka](http://kafka.apache.org/)
- running most of the pipeline in a locked-down AWS security group, restricting SSH access to the hosts, and only presenting data that might contain user information to those with a need to know

Instrumented processes emit Trace events over UDP to a local process,
[code.justin.tv/release/villagers/cmd/barrel](https://git-aws.internal.justin.tv/release/villagers/tree/master/cmd/barrel),
listening on localhost:8943. Barrel forwards the events to collection processes,
[cmd/pbcollect](cmd/pbcollect), it discovers
[via consul](http://consul.internal.justin.tv/ui/dist/#/us-west2/services/code.justin.tv/release/trace/cmd/pbcollect?filter=col).

The collection processes send events to the trace-events Kafka topic. This
topic is consumed by [cmd/statsdsink](cmd/statsdsink) which sends server-
observed request durations to statsd for creating
[per-repo service dashboards](http://code.justin.tv/video/usher). The trace-
events topic is also consumed by [cmd/aggregate](cmd/aggregate), which sends
complete transactions to the trace- transactions topic after a short delay.

The trace-transactions Kafka topic is consumed by the
[Trace gRPC API server](https://godoc.internal.justin.tv/code.justin.tv/release/trace/pbmsg#TraceClient),
[cmd/api](cmd/api).

[Ganglia host list](https://ganglia-ec2.internal.justin.tv/?r=hour&cs=&ce=&c=unclassified&h=&tab=m&vn=&hide-hf=false&m=load_one&sh=2&z=small&hc=0&host_regex=%5Etrace-&max_graphs=0&s=by+name)

[Kafka dashboard](http://grafana.prod.us-west2.justin.tv/dashboard/db/trace-kafka)

[Kafka consumer group dashboard](http://grafana.prod.us-west2.justin.tv/dashboard/db/trace-kafka-consumer-group-lag)

[Dashboard for cmd/pbcollect](http://grafana.prod.us-west2.justin.tv/dashboard/db/trace-collector)

[Dashboard for cmd/statsdsink](http://grafana.prod.us-west2.justin.tv/dashboard/db/trace-statsdsink)

[Dashboard for cmd/aggregate](http://grafana.prod.us-west2.justin.tv/dashboard/db/trace-aggregator)
