May 28, 2021 | 04:50

Go Performance Tools Cheat Sheet

Go has a lot of tools available for you to understand where your application might be spending CPU time or allocating memory. I don’t use these tools daily so I always end up searching for the same thing every time. This post aims to be a reference document for everything that Go has to provide.

We’ll be using https://gitlab.com/steve-blog/go-performance-tools-cheat-sheet as a demo project and there are 3 implementations of the same thing, one more performant than the other.

Benchmarks

One of the most popular ways to see if you improved something is to use Benchmarks which is built into Go.

In our demo project, there is already benchmarks avaiable and we can run them with a single command.

go test -bench=. -test.benchmem  ./rand/

goos: darwin
goarch: amd64
pkg: gitlab.com/steve-blog/blog/go-performance-tools-cheat-sheet/rand
cpu: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
BenchmarkHitCount100-8              3020            367016 ns/op          269861 B/op       3600 allocs/op
BenchmarkHitCount1000-8              326           3737517 ns/op         2696308 B/op      36005 allocs/op
BenchmarkHitCount100000-8              3         370797178 ns/op        269406189 B/op   3600563 allocs/op
BenchmarkHitCount1000000-8             1        3857843580 ns/op        2697160640 B/op 36006111 allocs/op
PASS
ok      gitlab.com/steve-blog/blog/go-performance-tools-cheat-sheet/rand 8.828s

Note: -test.benchmem is an optional flag to show memory allocations

Taking a closer look at what each column means:

BenchmarkHitCount100-8              3020            367016 ns/op             269861 B/op               3600 allocs/op
^------------------^ ^                ^                   ^                      ^                          ^
         |           |                |                   |                      |                          |
       Name   Number of CPUs     Total runs    Nanoseconds per operation   Bytes per operation   Allocations per operations

Comparing Benchmarks

Go created perf which provides benchstat so that you can compare to benchmark outputs together and it will give you the delta between them.

For example, let’s compare the main and best branches.

# Run benchmarks on `main`
git checkout main
go test -bench=. -test.benchmem -count=5 ./rand/ > old.txt

# Run benchmarks on `best
git checkout best
go test -bench=. -test.benchmem -count=5 ./rand/ > new.txt

# Compare the two benchmark results
benchstat old.txt new.txt
name               old time/op    new time/op    delta
HitCount100-8         366µs ± 0%     103µs ± 0%  -71.89%  (p=0.008 n=5+5)
HitCount1000-8       3.66ms ± 0%    1.06ms ± 5%  -71.13%  (p=0.008 n=5+5)
HitCount100000-8      367ms ± 0%     104ms ± 1%  -71.70%  (p=0.008 n=5+5)
HitCount1000000-8     3.66s ± 0%     1.03s ± 1%  -71.84%  (p=0.016 n=4+5)

name               old alloc/op   new alloc/op   delta
HitCount100-8         270kB ± 0%      53kB ± 0%  -80.36%  (p=0.008 n=5+5)
HitCount1000-8       2.70MB ± 0%    0.53MB ± 0%  -80.39%  (p=0.008 n=5+5)
HitCount100000-8      270MB ± 0%      53MB ± 0%  -80.38%  (p=0.008 n=5+5)
HitCount1000000-8    2.70GB ± 0%    0.53GB ± 0%  -80.39%  (p=0.016 n=4+5)

name               old allocs/op  new allocs/op  delta
HitCount100-8         3.60k ± 0%     1.50k ± 0%  -58.33%  (p=0.008 n=5+5)
HitCount1000-8        36.0k ± 0%     15.0k ± 0%  -58.34%  (p=0.008 n=5+5)
HitCount100000-8      3.60M ± 0%     1.50M ± 0%  -58.34%  (p=0.008 n=5+5)
HitCount1000000-8     36.0M ± 0%     15.0M ± 0%  -58.34%  (p=0.008 n=5+5)

Notice that we pass the -count flag to run the benchmarks multiple times so it can get the mean of the runs.

pprof

Go comes with its own profiler where it will give you a better understanding of where the CPU time is being spent on or where the application is allocatingthe memory. Go samples these over some time for example it will look at the CPU/Memory usage every X nanoseconds for X amount of seconds.

Generating Profiles

Benchmarks

You can generate profiles using benchmarks that we have in the demo project.

CPU:

go test -bench=. -cpuprofile cpu.prof ./rand/

Memory:

go test -bench=. -memprofile mem.prof ./rand/

net/http/pprof package

If you are writing a webserver you can import the net/http/pprof and it will expose the /debug/pprof HTTP endpoints on the DefaultServeMux as we are doing in the demo application.

Make sure that your application is not sitting idle and doing work or receiving requests so that the profiler can sample calls, or you might end up with an empty profile since the application is idle.

# CPU profile
curl http://127.0.0.1:8080/debug/pprof/profile > /tmp/cpu.prof
# Heap profile
curl http://127.0.0.1:8080/debug/pprof/heap > /tmp/heap.prof
# Allocations profile
curl http://127.0.0.1:8080/debug/pprof/allocs > /tmp/allocs.prof

If you visit /debug/pprof it will give a list of all the available endpoints and what they mean.

runtime/pprof package

This is similar to the net/http/pprof where add it to your application, but instead of adding for all of the project, you can specify a specific code path where you want to generate the profile. This can be useful when you are only interested in a certain part of your application and you want to sample only that part of the application. To read how to use it check the go reference.

You might also use this to label your application which can help you understand the profile better.

Reading profiles

Now that we know how to generate profiles, let’s see how we can read them to know what our application is doing.

The command that we will be using is go tool pprof.

Callgraph

For example to use the /debug/pprof endpoints that we have registered in our demo application, we can pass in the HTTP endpoint directly.

# Open new browser window with call graph after 30s profiling.
go tool pprof -http :9402 http://127.0.0.1:8080/debug/pprof/profile

Another option is to use curl to download the profile and then use go tool which might be useful to get profiles from production endpoints that aren’t exposed to the public internet.

# Server.
curl http://127.0.0.1:8080/debug/pprof/profile > /tmp/cpu.prof
# Locally after you get it from the server.
go tool pprof -http :9402 /tmp/cpu.prof

Notice that in all the commands we are passing the -http flag, this is optional because by default this opens the CLI interface.

Below you can see our demo application call graph. To better understand what it means you should read Interpreting the Callgraph

Example of a call graph of our demo application

Flame graphs

The callgraph is useful to see what the program is calling and can also help you understand where the application is spending time. An alternative way of understanding where the CPU time or memory allocation is going is to use a Flame Graph.

Using the same demo application let’s run go tool pprof once more:

# From endpoint
go tool pprof -http :9402 http://127.0.0.1:8080/debug/pprof/profile
# From the local profile
go tool pprof -http :9402 /tmp/cpu.prof

However this time we will use the top navigation bar go to View > Flame Graph

navigating to the flame graph

Then you should see something like below:

flame graph example

For you to better understand how to read flame graphs you can check out What Are Flame Graphs and How to Read Them, RubyConfBY 2017

You can also use speedscope which is a language-agnostic application to generate flame graphs from profiles and it’s a bit more interactive than the one provided from Go.

demo of speedscope

Comparing profiles

The go tool pprof also allows you to use to compare profiles to show you the difference between 1 profile and another.

Using our demo project once again we can generate profiles for the main and best branches.

# Run on `main`
curl http://127.0.0.1:8080/debug/pprof/profile > /tmp/main.prof

# Run on `best`
curl http://127.0.0.1:8080/debug/pprof/profile > /tmp/best.prof

# Compare profiles
go tool pprof -http :9402 --diff_base=/tmp/main.prof /tmp/best.prof

demo of profile diff

Traces

One last tool that we need to go through is the CPU tracer. This gives you an accurate representation of what was happening during the program execution. It can show you which cores are sitting idle and which ones are busy. This is great if you are debugging some concurrent code that isn’t performing as expected.

Using our demo application again let’s get the CPU trace using the /debug/pprof/trace endpoint that is added using the net/http/pprof.

curl http://127.0.0.1:8080/debug/pprof/trace > /tmp/best.trace

go tool trace /tmp/best.trace

A more detailed explanation of the tracer can be found over at Gopher Academy Blog

demo of trace

Resources