# 2015-10-15 Go tip GC performance analysis for TMI edge server

Analysis by Rhys Hiltner, based on previous analyses by Rhys Hiltner, John
Rizzo, and Martin Hess.

## History

GC pause time has been a long-time concern of the TMI edge server
(code.justin.tv/chat/tmi/irc). At its inception during Go 1.2's development
cycle, the Go TMI edge server was able to serve over 500,000 concurrent users
(each with a TCP connection) using 1,500,000 goroutines, with the only
performance concern being GC pause times (which were in excess of 45 seconds).
The Go edge server took about three weeks to write, followed by two months of
GC-related performance tuning.

The mostly-concurrent incremental garbage collector introduced with Go 1.5
reduces the edge server's GC pause times significantly, but they are far from
the Go team's stated goal of 10ms.

## Overview

With go version `go1.5`, GC pause time is around 200ms with a 1.4GB average
heap. With a recent version of tip [`devel +30b9663 Tue Oct 13 21:06:58 2015
+0000`](https://github.com/golang/go/tree/30b9663/) from the 1.6 development
cycle, GC pause time is around 100ms with a 1.1GB average heap. With the same
development version of Go and stack shrinking disabled via
`GODEBUG=gcshrinkstackoff=1`, GC pause time is around 30-70ms with a 1.2GB
average heap.

In the chart below, the blue and red lines represent the average heap size and
average GC pause times of the TMI edge server when running with a development
version of Go. Before 2015-10-14 18:00 UTC, the process running the
development version had stack shrinking enabled (as is the default behavior).
After that time,stack shrinking is disabled via `GODEBUG=gcshrinkstackoff=1`.

![HeapAlloc-vs-Pause](HeapAlloc-vs-Pause.png)

## Performance analysis

The test process is compiled with go version `devel +30b9663 Tue Oct 13
21:06:58 2015 +0000`. The toolchain is compiled with
`GOEXPERIMENT=framepointer` to allow profiling with Linux's perf_events tools.

The `perf` tool from the perf_events suite is used to periodically collect
stacks of on-CPU threads within the Go process. Commands are similar to the
following:

```
sudo perf record -g -F 977 -p 3095 -o go_30b9663_F977_edge_shrinkoff_3000s-5.data -- sleep 3000
```

- The `-g` flag instructs perf to collect complete stacks. This is enabled by
compiling the Go toolchain with `GOEXPERIMENT=framepointer` and will allow us
to know the identities of all functions on the stack, not just the currently
executing function.

- `-F 977` instructs perf to record a stack trace for every 1/977th of a
second that a thread is on-CPU. It's not exactly 100 or 1000 to avoid mis-
sampling periodic events.

- `-p 3095` tells perf to record the performance of a single process, pid
3095.

- `-o go_30b9663_F977_edge_shrinkoff_3000s-5.data` sets the name of the output
file.

- `--` marks the end of perf's named arguments. What follows, `sleep 3000`, is
the command to profile. Because we've set a specific pid with the `-p` flag,
perf will profile the pid instead but will stop profiling as soon as the sleep
command exits.

Once the data is recorded, it can be converted into a [flame
graph](http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html) for
convenient viewing.

This analysis uses version `182b24fb635345d48c91ed1de58a08b620312f3d` of
Brendan Gregg's [perl-based FlameGraph
code](https://github.com/brendangregg/FlameGraph), with some changes to
support Go function names (which include parentheses when referring to
methods).

```
diff --git a/flamegraph.pl b/flamegraph.pl
index bf00a04..c4a6334 100755
--- a/flamegraph.pl
+++ b/flamegraph.pl
@@ -656,7 +656,7 @@ my $inc = <<INC;
                var r = find_child(e, "rect");
                var t = find_child(e, "text");
                var w = parseFloat(r.attributes["width"].value) -3;
-               var txt = find_child(e, "title").textContent.replace(/\\([^(]*\\)/,"");
+               var txt = find_child(e, "title").textContent.replace(/\\([^(]*\\)\$/,"");
                t.attributes["x"].value = parseFloat(r.attributes["x"].value) +3;
                
                // Smaller than this size won't fit anything
diff --git a/stackcollapse-perf.pl b/stackcollapse-perf.pl
index 5e8f9e2..f3b2e0e 100755
--- a/stackcollapse-perf.pl
+++ b/stackcollapse-perf.pl
@@ -212,7 +212,7 @@ foreach (<>) {
                if ($tidy_generic) {
                        $func =~ s/;/:/g;
                        $func =~ tr/<>//d;
-                       $func =~ s/\(.*//;
+                       $func =~ s/[^\.]\(.*//;
                        # now tidy this horrible thing:
                        # 13a80b608e0a RegExp:[&<>\"\'] (/tmp/perf-7539.map)
                        $func =~ tr/"\'//d;
```

The Go 1.5 GC has two Stop-The-World pauses: one to make sure the sweep from
the previous collection is complete and to enable the write barrier, and
another to terminate the mark phase. The bulk of the pause time is in the mark
termination phase, so the analysis will focus on that phase. The mark
termination phase is visible from a profiler by `runtime.gcMark` being on the
stack.

Converting kernel addresses to function names requires a copy of the exact
kernel in use with debug symbols still intact.

```
curl -O 'http://ddebs.ubuntu.com/pool/main/l/linux-lts-trusty/linux-image-3.13.0-52-generic-dbgsym_3.13.0-52.86~precise1_amd64.ddeb'
dpkg -x linux-image-3.13.0-52-generic-dbgsym_3.13.0-52.86~precise1_amd64.ddeb linux-image-3.13.0-52-generic-dbgsym_3.13.0-52.86~precise1_amd64
```

Converting the recorded performance data into a flame graph is a multi-step
process. `perf script` will dump the data to stdout in a textual format. Its
`-k` flag tells it the path to the copy of the kernel with debug symbols. The
`-i` flag specifies the input file name. The Go program we're profiling has
debug symbols built in (as Go binaries do by default), so there's no extra
process needed for resolving its symbols.

The `FlameGraph/stackcollapse-perf.pl` program converts the samples that `perf
script` outputs as multiline blocks into single lines of semicolon-delimited
function names, followed by the number of samples including that same stack.
It uniquifies stacks by function name, without considering the exact program
counter address. Be warned *it can use a ton of memory*.

The recorded performance data includes everything that the program was doing
during the measurement time, but we're only interested in the CPU usage during
the mark termination phase. To that end, we filter out stacks that don't
include the `runtime.gcMark` function and then build the flame graph svg.

```
export file="go_30b9663_F977_edge_shrinkoff_3000s-5"
perf script -k linux-image-3.13.0-52-generic-dbgsym_3.13.0-52.86~precise1_amd64/usr/lib/debug/boot/vmlinux-3.13.0-52-generic -i "$file.data" | FlameGraph/stackcollapse-perf.pl > "$file.collapsed"
cat "$file.collapsed" | grep ';runtime\.gcMark[; ]' | FlameGraph/flamegraph.pl > "$file.gcMark.svg"
```

### Running the development version without tuning

The first experiment was to run the TMI edge server in a darklaunch
configuration without any special tuning. The go environment variables
included `GOMAXPROCS=18` (the host has 40 logical cores), and
`GODEBUG=gctrace=1`.

![Mark termination of untuned runtime](go_30b9663_F977_edge_3000s-1.gcMark.png)

Most of the time that `runtime.gcMark` is on the stack, it is calling
`runtime.parfordo`. This means that the work is split among `GOMAXPROCS`
threads. To the left of `runtime.parfordo`, we see `runtime.gcMark` calling
`runtime.freeStackSpans` in 6.57% (63/959) of samples (the text is truncated,
you can download and view the svg to see for yourself). This call happens on a
single thread, so likely accounts for much more than 6.57% of the wall clock
time.

Under `runtime.markroot`, we see 10.74% of CPU time (103/959 samples) is spent
in `runtime.scang` and 75.29% (722/959 samples) is spent in
`runtime.shrinkstack`. When a goroutine's stack requirements grow, it is
assigned a larger stack and all of its data is copied over. Pointers to the
stack (which can only occur within the stack itself) are rewritten. The
goroutine continues running on the larger stack even its requirements shrink
-- until a garbage collection happens, at which point it may be migrated to a
smaller stack. This migration currently happens during the mark termination
phase, and with the large number of goroutines that the edge server uses, can
take a huge amount of time!

`runtime.gcRemoveStackBarriers` is on the stack in 69/959 samples, with
`runtime.gcRemoveStackBarrier` in 39/959 samples.

The 722/959 samples in `runtime.shrinkstack` include 34 in
`runtime.adjustdefers`, 22 in `runtime.adjustsudogs`, 165 in
`runtime.gentraceback` (with 81 in `runtime.adjustframe`).

The rest of `runtime.shrinkstack` is spent dealing with locks. 250 samples
lead to `runtime.stackalloc` of which 102 lead to `runtime.lock` and 89 to
`runtime.unlock`. 216 samples lead to `runtime.stackfree` of which 82 lead to
`runtime.lock` and 65 to `runtime.unlock`.

All in all, `runtime.copystack` (called by `runtime.shrinkstack`) spends
nearly half of its time dealing with locks, 338 out of 695 cycles.

We can disable stack shrinking at runtime by starting our program with
`GODEBUG=gcshrinkstackoff=1`. When a goroutine's stack grows, it will remain
at the larger size for the lifetime of the goroutine.

But the great news so far is that the mark termination phase no longer
includes cycles spent on finalizers, as go1.5 did.

### Running with stack shrinking disabled

When running with `GOMAXPROCS=18` and `GODEBUG=gctrace=1,gcshrinkstackoff=1`,
we see another improvement in mark termination pause time.

The large CPU time contributors present with this configuration are as
follows. The sample counts indicate how often the named function was on the
stack, not necessarily at the top of the stack:

- runtime.gcMark -> runtime.freeStackSpans

This code is not parallelized, and [doesn't even need to happen during the STW
phase](https://github.com/golang/go/blob/30b966307f475b1445816308f8cb2c5813b38232/src/runtime/mgc.go#L1521-L1525).

In the five datasets collected, this accounts for (8/431, 71/872, 40/479, 8/981, 9/889) of CPU cycles.

- runtime.gcMark -> runtime.parfordo -> runtime.markroot -> runtime.atomicload

There's an inlined call to `runtime.readgstatus` made by `runtime.markroot`.
The markroot function is used during the concurrent scan phase and during the
STW mark termination phase. Is (costly) atomic memory access required while
the world is stopped?

In the five datasets collected, this accounts for (55/431, 76/872, 89/479, 62/981, 67/889) of CPU cycles.

- runtime.gcMark -> runtime.parfordo -> runtime.markroot -> runtime.scang ->
runtime.scanstack -> runtime.gcRemoveStackBarriers

Most of the CPU time in `runtime.scanstack` is spent removing stack barriers.
Do the stack barriers need to be removed while the world is stopped, or could
they be removed by the individual goroutines as they resume execution after
the collection completes?

In the five datasets collected, this accounts for (167/431, 507/872, 192/479, 626/981, 548/889) of CPU cycles.

- runtime.gcMark -> runtime.parfordo -> runtime.markroot -> runtime.scang ->
runtime.scanstack -> runtime.gcRemoveStackBarriers -> kernel page_fault

Page faults don't make a big appearance when stack shrinking is allowed, but
are a huge (and highly variable) contributor to mark termination time when
stack shrinking is disabled.

In the five datasets collected, this accounts for (42/431, 346/872, 104/479, 425/981, 387/889) of CPU cycles.

- runtime.gcMark -> runtime.parfordo -> runtime.markroot -> runtime.scang ->
runtime.scanstack -> runtime.gcRemoveStackBarriers ->
runtime.gcRemoveStackBarrier

Removing each individual stack barrier doesn't cost much compared to the
expense of iterating the list of a goroutine's stack barriers (and handling
the resulting page faults).

In the five datasets collected, this accounts for (43/431, 33/872, 24/479, 39/981, 24/889) of CPU cycles.

![Mark termination of gcshrinkstackoff=1 runtime](go_30b9663_F977_edge_shrinkoff_3000s-1.gcMark.png)
![Mark termination of gcshrinkstackoff=1 runtime](go_30b9663_F977_edge_shrinkoff_3000s-2.gcMark.png)
![Mark termination of gcshrinkstackoff=1 runtime](go_30b9663_F977_edge_shrinkoff_3000s-3.gcMark.png)
![Mark termination of gcshrinkstackoff=1 runtime](go_30b9663_F977_edge_shrinkoff_3000s-4.gcMark.png)
![Mark termination of gcshrinkstackoff=1 runtime](go_30b9663_F977_edge_shrinkoff_3000s-5.gcMark.png)
