Golang: Benchmarking and Optimizing Code

I have been working with Golang since 3 years ago, I love it because of its performance and lightweight. Golang also has good built-in tools that absolutely help you in development. One of that is pprof, a tools for profiling data and visualize it. By analyzing profiling data from pprof, we will able to identify which part that consume lots of memory and time. Awesome right!!!

In my previous company 2018, I was facing an OOM issue after using a 3rd party package in my service. There was OK in local and staging, but after deploying to production, BOOM!!! The memory usage is increasing drastically, reached 100%, and down. I have no idea whats going on because there is no error recorded in the log (before reached 100%). The only thing that different between production and staging is the traffic.

After debugging for some hours, I was very lucky to know Golang has pprof. I was able to identify the culprit and fix the code. I was amazed by this tools, I even did a little sharing session after the outage to spread it across my engineer division. And finally I can share in this article as well. Please find the material on https://github.com/w-bt/benchmark_v2

Prerequisite

Golang 1.11 or newer (I am using 1.14 when I created this article)
Benchcmp (go get golang.org/x/tools/cmd/benchcmp)
Any editor

Test Case

Let’s say we have a lot of products data that stored in global variable products. Each product has name and code. Function GenerateProduct() (in data.go) generates products, we can also assume this data coming from database / other service. I’ve created a simple end point to get the product based on its code.

package main

import (
	"log"
	"net/http"
	"regexp"
)

var products map[string]*Product

func init() {
	GenerateProduct()
}

func main() {
	log.Printf("Starting on port 1234")
	http.HandleFunc("/product", handleProduct)
	log.Fatal(http.ListenAndServe("127.0.0.1:1234", nil))
}

func handleProduct(w http.ResponseWriter, r *http.Request) {
	code := r.FormValue("code")
	if match, _ := regexp.MatchString(`^[A-Z]{2}[0-9]{2}$`, code); !match {
		http.Error(w, "code is invalid", http.StatusBadRequest)
		return
	}

	result := findProduct(products, code)

	if result.Code == "" {
		http.Error(w, "Data Not Found", http.StatusBadRequest)
		return
	}

	w.Header().Set("Content-Type", "text/html; charset=utf-8")
	w.Write([]byte(`<font size="10">Product Code : ` + result.Code + ` Name :` + result.Name + `</font>`))
}

func findProduct(Products map[string]*Product, code string) Product {
	for _, item := range Products {
		if code == (*item).Code {
			return *item
		}
	}

	return Product{}
}

Run It

$ go build && ./benchmark

Open browser and hit http://localhost:1234/product?code={code}.

Test

In order to understand the intent of the code, bellow is the happy path test

package main

import (
	"bufio"
	"net/http"
	"net/http/httptest"
	"os"
	"strings"
	"testing"
)

func TestMain(m *testing.M) {
	GenerateProduct()

	code := m.Run()

	os.Exit(code)
}

func TestHandleProduct(t *testing.T) {
	r, _ := http.NewRequest("GET", "/product?code=AA11", nil)
	w := httptest.NewRecorder()
	handleProduct(w, r)
	if !strings.Contains(w.Body.String(), "Product Code") {
		t.Error("An error was not expected")
	}
}

$ go test -cover -race -v
=== RUN   TestHandleProduct
--- PASS: TestHandleProduct (0.00s)
PASS
coverage: 73.3% of statements
ok  	github.com/w-bt/benchmark	2.048s

GO TEST BENCHMARK

Here we go. We need to hit the end point for multiple times to get accurate profiling data. Lets use testing.B instead of testing.T (the B refer to Benchmark). In this benchmark test, please note:

b.ReportAllocs() is necessary to enable malloc statistics
need to reset the response for each call (see func reset(rw *httptest.ResponseRecorder))

func BenchmarkHandleProduct(b *testing.B) {
	b.ReportAllocs()
	r, _ := http.ReadRequest(bufio.NewReader(strings.NewReader("GET /product?code=ZZ99 HTTP/1.0\r\n\r\n")))
	rw := httptest.NewRecorder()
	for i := 0; i < b.N; i++ {
		handleProduct(rw, r)
		rw.Body.Reset()
	}
}

Then execute it for some times (in this case 30000 times). The longer we take, the more accurate result we get. Flag cpuprofile is used to generate profiling data in time unit, flag memprofile is used to generate profiling data in bytes unit, flag o is used to save test binary to other file name.

$ go test -v -bench=. -benchtime=30000x -cpuprofile=prof.cpu -memprofile=prof.mem -o bench.test | tee prof.txt
goos: darwin
goarch: amd64
pkg: github.com/w-bt/benchmark_v2/opt_1
BenchmarkHandleProduct
BenchmarkHandleProduct-12          30000           1295360 ns/op            3354 B/op         49 allocs/op
PASS
ok      github.com/w-bt/benchmark_v2/opt_1      40.408s

This command produces 4 new files:

binary test (bench.test)
cpu profile (prof.cpu)
memory profile (prof.mem)
benchmark result (prof1)

Based on the data above, below is detailed result:

total execution: 30000 times
duration each operation: 1058075 ns/op
each iteration costs: 3354 bytes with 49 allocations

CPU Profiling

Now we are done with profiling, lets take a look the result

$ go tool pprof bench.test prof.cpu
File: bench.test
Type: cpu
Time: Mar 4, 2021 at 11:36am (WIB)
Duration: 31.67s, Total samples = 30.06s (94.92%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 34.55s, 95.89% of 36.03s total
Dropped 104 nodes (cum <= 0.18s)
Showing top 10 nodes out of 29
      flat  flat%   sum%        cum   cum%
    15.21s 42.21% 42.21%     15.21s 42.21%  memeqbody
     7.42s 20.59% 62.81%     31.26s 86.76%  github.com/w-bt/benchmark_v2/opt_1.findProduct
     6.89s 19.12% 81.93%      8.13s 22.56%  runtime.mapiternext
     2.83s  7.85% 89.79%      2.83s  7.85%  aeshashbody
     0.50s  1.39% 91.17%      0.50s  1.39%  runtime.memequal
     0.48s  1.33% 92.51%      0.48s  1.33%  runtime.madvise
     0.45s  1.25% 93.76%      0.45s  1.25%  runtime.nanotime1
     0.35s  0.97% 94.73%      0.35s  0.97%  runtime.add
     0.21s  0.58% 95.31%      0.21s  0.58%  runtime.(*bmap).overflow (inline)
     0.21s  0.58% 95.89%      0.21s  0.58%  runtime.(*maptype).indirectkey
(pprof) top5
Showing nodes accounting for 32850ms, 91.17% of 36030ms total
Dropped 104 nodes (cum <= 180.15ms)
Showing top 5 nodes out of 29
      flat  flat%   sum%        cum   cum%
   15210ms 42.21% 42.21%    15210ms 42.21%  memeqbody
    7420ms 20.59% 62.81%    31260ms 86.76%  github.com/w-bt/benchmark_v2/opt_1.findProduct
    6890ms 19.12% 81.93%     8130ms 22.56%  runtime.mapiternext
    2830ms  7.85% 89.79%     2830ms  7.85%  aeshashbody
     500ms  1.39% 91.17%      500ms  1.39%  runtime.memequal
(pprof) top --cum
Showing nodes accounting for 32.36s, 89.81% of 36.03s total
Dropped 104 nodes (cum <= 0.18s)
Showing top 10 nodes out of 29
      flat  flat%   sum%        cum   cum%
     0.01s 0.028% 0.028%     34.63s 96.11%  github.com/w-bt/benchmark_v2/opt_1.BenchmarkHandleProduct
         0     0% 0.028%     34.63s 96.11%  testing.(*B).launch
         0     0% 0.028%     34.63s 96.11%  testing.(*B).runN
         0     0% 0.028%     34.62s 96.09%  github.com/w-bt/benchmark_v2/opt_1.handleProduct
     7.42s 20.59% 20.62%     31.26s 86.76%  github.com/w-bt/benchmark_v2/opt_1.findProduct
    15.21s 42.21% 62.84%     15.21s 42.21%  memeqbody
     6.89s 19.12% 81.96%      8.13s 22.56%  runtime.mapiternext
         0     0% 81.96%      2.84s  7.88%  net/http.Header.Set (inline)
         0     0% 81.96%      2.84s  7.88%  net/textproto.MIMEHeader.Set (inline)
     2.83s  7.85% 89.81%      2.83s  7.85%  aeshashbody
(pprof)

Detailed informations:

top: show 10 top functions that takes a lot of resources
top5: show 5 top functions that takes a lot of resources
top –cum: show 10 top function that takes a lot of resources based on cum (descending)
flat: the duration of direct operation inside the function (function call doesn’t count as flat)
flat%: flat percentage (flat/total flats)*100
sum%: sum of current flat% and previous flat%, sum% could help you to identify the big rocks quickly
cum: cumulative duration for the function, it is the value of the location plus all its descendants.
cum%: cumulative percentage (cum/total cum)*100

Flat and Sum

Assuming there is a function foo, which is composed of 2 functions and a direct operation.

func foo(){
    a()                                 // step1
    direct operation                    // step2
    b()                                 // step3
}

In example when we hit foo, it takes 4 seconds, and the time distribution are following.

func foo(){
    a()                                 // step1 takes 1s
    direct operation                    // step2 takes 2s
    b()                                 // step3 takes 1s
}

flat would be the time spent on step2. cum would be the total execution time of foo, which contains sub-function call and direct operations. (cum = step1 + step2 + step3)

Now, based on data above, func findProduct takes 26.40s, hmmm.. suspicious isn’t it? lets see more detail:

(pprof) list handleProduct
Total: 36.03s
ROUTINE ======================== github.com/w-bt/benchmark_v2/opt_1.handleProduct in /Users/pt.gojekindonesia/Documents/code/go/src/github.com/w-bt/benchmark_v2/opt_1/main.go
         0     34.62s (flat, cum) 96.09% of Total
         .          .     18:   http.HandleFunc("/product", handleProduct)
         .          .     19:   log.Fatal(http.ListenAndServe("127.0.0.1:1234", nil))
         .          .     20:}
         .          .     21:
         .          .     22:func handleProduct(w http.ResponseWriter, r *http.Request) {
         .       20ms     23:   code := r.FormValue("code")
         .      430ms     24:   if match, _ := regexp.MatchString(`^[A-Z]{2}[0-9]{2}$`, code); !match {
         .          .     25:           http.Error(w, "code is invalid", http.StatusBadRequest)
         .          .     26:           return
         .          .     27:   }
         .          .     28:
         .     31.26s     29:   result := findProduct(products, code)
         .          .     30:
         .          .     31:   if result.Code == "" {
         .          .     32:           http.Error(w, "data not found", http.StatusBadRequest)
         .          .     33:           return
         .          .     34:   }
         .          .     35:
         .      2.84s     36:   w.Header().Set("Content-Type", "text/html; charset=utf-8")
         .       70ms     37:   w.Write([]byte(`<font size="10">Product Code : ` + result.Code + ` Name :` + result.Name + `</font>`))
         .          .     38:}
         .          .     39:
         .          .     40:func findProduct(Products map[string]*Product, code string) *Product {
         .          .     41:   for _, item := range Products {
         .          .     42:           if code == (*item).Code {
(pprof) list findProduct
Total: 36.03s
ROUTINE ======================== github.com/w-bt/benchmark_v2/opt_1.findProduct in /Users/pt.gojekindonesia/Documents/code/go/src/github.com/w-bt/benchmark_v2/opt_1/main.go
     7.42s     31.26s (flat, cum) 86.76% of Total
         .          .     36:   w.Header().Set("Content-Type", "text/html; charset=utf-8")
         .          .     37:   w.Write([]byte(`<font size="10">Product Code : ` + result.Code + ` Name :` + result.Name + `</font>`))
         .          .     38:}
         .          .     39:
         .          .     40:func findProduct(Products map[string]*Product, code string) *Product {
     760ms      8.89s     41:   for _, item := range Products {
     6.66s     22.37s     42:           if code == (*item).Code {
         .          .     43:                   return item
         .          .     44:           }
         .          .     45:   }
         .          .     46:
         .          .     47:   return &Product{}
(pprof)

To see in UI form, use web:

Found it, it is because we do looping all of the products and compare it one by one. Even it has O(n), it doesn’t make sense if we have million or billion data.

Interpreting The CallGraph

Node Color:
- large positive cum values are red.
- cum values close to zero are grey.
Node Font Size:
- larger font size means larger absolute flat values.
- smaller font size means smaller absolute flat values.
Edge Weight:
- thicker edges indicate more resources were used along that path.
- thinner edges indicate fewer resources were used along that path.
Edge Color:
- large positive values are red.
- values close to zero are grey.
Dashed Edges: some locations between the two connected locations were removed.
Solid Edges: one location directly calls the other.
“(inline)” Edge Marker: the call has been inlined into the caller.

Memory Profiling

Similar with CPU Profiling, execute this command

$ go tool pprof bench.test prof.mem
File: bench.test
Type: alloc_space
Time: Mar 14, 2021 at 4:05pm (WIB)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top --cum
Showing nodes accounting for 32MB, 29.99% of 106.70MB total
Dropped 3 nodes (cum <= 0.53MB)
Showing top 10 nodes out of 39
      flat  flat%   sum%        cum   cum%
         0     0%     0%    96.52MB 90.46%  github.com/w-bt/benchmark_v2/opt_1.BenchmarkHandleProduct
    3.50MB  3.28%  3.28%    96.52MB 90.46%  github.com/w-bt/benchmark_v2/opt_1.handleProduct
         0     0%  3.28%    96.52MB 90.46%  testing.(*B).launch
         0     0%  3.28%    96.52MB 90.46%  testing.(*B).runN
         0     0%  3.28%    92.02MB 86.24%  regexp.Compile (inline)
         0     0%  3.28%    92.02MB 86.24%  regexp.MatchString
    5.50MB  5.16%  8.44%    92.02MB 86.24%  regexp.compile
         0     0%  8.44%    31.01MB 29.06%  regexp.compileOnePass
       3MB  2.81% 11.25%       25MB 23.43%  regexp/syntax.Parse
      20MB 18.75% 29.99%       20MB 18.75%  regexp/syntax.(*parser).newRegexp (inline)
(pprof) list handleProduct
Total: 106.70MB
ROUTINE ======================== github.com/w-bt/benchmark_v2/opt_1.handleProduct in /Users/pt.gojekindonesia/Documents/code/go/src/github.com/w-bt/benchmark_v2/opt_1/main.go
    3.50MB    96.52MB (flat, cum) 90.46% of Total
         .          .     19:   log.Fatal(http.ListenAndServe("127.0.0.1:1234", nil))
         .          .     20:}
         .          .     21:
         .          .     22:func handleProduct(w http.ResponseWriter, r *http.Request) {
         .          .     23:   code := r.FormValue("code")
         .    92.02MB     24:   if match, _ := regexp.MatchString(`^[A-Z]{2}[0-9]{2}$`, code); !match {
         .          .     25:           http.Error(w, "code is invalid", http.StatusBadRequest)
         .          .     26:           return
         .          .     27:   }
         .          .     28:
         .          .     29:   result := findProduct(products, code)
         .          .     30:
         .          .     31:   if result.Code == "" {
         .          .     32:           http.Error(w, "data not found", http.StatusBadRequest)
         .          .     33:           return
         .          .     34:   }
         .          .     35:
         .        1MB     36:   w.Header().Set("Content-Type", "text/html; charset=utf-8")
    3.50MB     3.50MB     37:   w.Write([]byte(`<font size="10">Product Code : ` + result.Code + ` Name :` + result.Name + `</font>`))
         .          .     38:}
         .          .     39:
         .          .     40:func findProduct(Products map[string]*Product, code string) *Product {
         .          .     41:   for _, item := range Products {
         .          .     42:           if code == (*item).Code {
(pprof) list MatchString
Total: 106.70MB
ROUTINE ======================== regexp.MatchString in /usr/local/Cellar/go/1.14.2/libexec/src/regexp/regexp.go
         0    92.02MB (flat, cum) 86.24% of Total
         .          .    526:
         .          .    527:// MatchString reports whether the string s
         .          .    528:// contains any match of the regular expression pattern.
         .          .    529:// More complicated queries need to use Compile and the full Regexp interface.
         .          .    530:func MatchString(pattern string, s string) (matched bool, err error) {
         .    92.02MB    531:   re, err := Compile(pattern)
         .          .    532:   if err != nil {
         .          .    533:           return false, err
         .          .    534:   }
         .          .    535:   return re.MatchString(s), nil
         .          .    536:}
(pprof) web handleProduct

Please ignore GenerateProduct (as I mention before, we can assume the data coming from database). Based on data above, regexp.MatchString takes the most memory. It takes 92.02MB!!! After digging deeper, it is make sense because we need to compile everytime we execute it. Imagine if we have more complex pattern, hahaha

Another thing we should consider is http response writer, It takes 3MB. Even it is not a big deal comparing to the regex issue, we can tune it.

Another Way

Use Web Version!!!! Open localhost:8081, it shown beautiful graph, you’ll love it!

$ go tool pprof -http=":8081" [binary] [profile]

OPTIMIZATION

UPDATE FINDPRODUCT()

why looping it if we can lookup the hash. It improves the complexity from O(n) to O(1).

func findProduct(Products map[string]*Product, code string) Product {
	if v, ok := Products[code]; ok {
		return *v
	}

	return Product{}
}

$ go test -v -bench=. -benchtime=30000x -cpuprofile=prof.cpu -memprofile=prof.mem -o bench.test | tee prof.txt
goos: darwin
goarch: amd64
pkg: github.com/w-bt/benchmark/opt_2
BenchmarkHandleProduct
BenchmarkHandleProduct-12          30000              4529 ns/op            3353 B/op         49 allocs/op
PASS
ok      github.com/w-bt/benchmark/opt_2 1.618s

Now we can compare prof1 (before) and prof2 (after) and see the different (delta)

$ benchcmp ../opt_1/prof.txt prof.txt
benchmark                     old ns/op     new ns/op     delta
BenchmarkHandleProduct-12     1295360       4529          -99.65%

benchmark                     old allocs     new allocs     delta
BenchmarkHandleProduct-12     49             49             +0.00%

benchmark                     old bytes     new bytes     delta
BenchmarkHandleProduct-12     3354          3353          -0.03%

In operation time, the delta is -99.65%, so it improve around ~286x.

UPDATE REGEX

The idea is compiling regex first and use it multiple times

codeRegex = regexp.MustCompile(`^[A-Z]{2}[0-9]{2}$`)
// . . .
if match := codeRegex.MatchString(code); !match {
	// . . .
}

$ go test -v -bench=. -benchtime=30000x -cpuprofile=prof.cpu -memprofile=prof.mem -o bench.test | tee prof.txt
goos: darwin
goarch: amd64
pkg: github.com/w-bt/benchmark/opt_3
BenchmarkHandleProduct
BenchmarkHandleProduct-12          30000               386 ns/op             144 B/op          3 allocs/op
PASS
ok      github.com/w-bt/benchmark/opt_3 1.485s

$ benchcmp ../opt_2/prof.txt prof.txt
benchmark                     old ns/op     new ns/op     delta
BenchmarkHandleProduct-12     4529          386           -91.48%

benchmark                     old allocs     new allocs     delta
BenchmarkHandleProduct-12     49             3              -93.88%

benchmark                     old bytes     new bytes     delta
BenchmarkHandleProduct-12     3353          144           -95.71%

It improves all of aspects, operation times, allocations, and bytes consuming around 9x better.

UPDATE CONCATE STRING

The idea is using bytes.Buffer. It allows us to reuse memory without allocate.

var buf       = new(bytes.Buffer)
// . . .
buf.Reset()
buf.WriteString(`<font size="10">Product Code : `)
buf.WriteString(result.Code)
buf.WriteString(` Name :`)
buf.WriteString(result.Name)
buf.WriteString(`</font>`)
w.Write(buf.Bytes())

$ go test -v -bench=. -benchtime=30000x -cpuprofile=prof.cpu -memprofile=prof.mem -o bench.test | tee prof.txt
goos: darwin
goarch: amd64
pkg: github.com/w-bt/benchmark/opt_4
BenchmarkHandleProduct
BenchmarkHandleProduct-12          30000               300 ns/op              16 B/op          1 allocs/op
PASS
ok      github.com/w-bt/benchmark/opt_4 1.221s

$ benchcmp ../opt_3/prof.txt prof.txt
benchmark                     old ns/op     new ns/op     delta
BenchmarkHandleProduct-12     386           300           -22.28%

benchmark                     old allocs     new allocs     delta
BenchmarkHandleProduct-12     3              1              -66.67%

benchmark                     old bytes     new bytes     delta
BenchmarkHandleProduct-12     144           16            -88.89%

Yup, it improves a little for operation times, and improves lots of allocations. It prevents the generation of the unnecessary string object, means it doesn’t generate a new string like in + operator from two or more strings.

CONCLUSION

Golang has a lot of rich toolset, pprof is only one of them. With pprof we can identify memory leaks and improve the performance.

Category: Golang, Programming