Analyze Lua Memory Leak With Systemtap


leak

Use systemtap to analyze the memory leak of lua code

The memory used by lua code is managed by the GC, not calling malloc/free/mmap directly. The luajit GC uses mark-and-sweep algorithm. In simple words, it links all allocated gc objects in a global list. When the memory is in pressure, it would trigger recycle procedures upon that list. Note that because of single-threading design, the GC is interspersed in the flow of lua code (either compiled or interpretered), e.g. after string allocation. GC would try to check for each gc object if it is still in used, e.g. as an upvalue, reside in a stack, etc. The GC would sweep all dangling gc objects at the end, and the corresponding memory would be freed.

In luajit, lj_alloc_malloc and lj_alloc_realloc would allocate gc objects, and lj_alloc_free would free them. So naturally, we could trace the invocations of these functions to check memory leak.

Note that triggering gc is a bit uncertain and the gc steps would be splited in different time slices. It would not free unused memory immediately. Moreover, the lua code may use caches. So it’s necessary to check the source code to distinguish the real memory leak.

systemtap script

I would use stapxx to analyze lua code running in openresty.

lua-leak.sxx

https://gist.github.com/kingluo/1df4736665a3381ce745f26cf17181d8#file-lua-leak-sxx

...
probe process("$^libluajit_path").function("lj_alloc_realloc").return,
      process("$^libluajit_path").function("lj_alloc_malloc").return
{
    if (gc_size_begin == 0) {
        gc_size_begin = get_gc_size()
    }

    gc_size_end = get_gc_size()

    if (tid() == target() && !quit) {
        ptr = returnval()
        bt = get_bt()
        if (ptr && bt != "") {
            if (alloc_cnt >= 100000) {
                report()
            }

            if (ptr2bt[ptr] == "") {
                alloc_cnt++
            }

            size = @entry($nsize)
            ptr2bt[ptr] = bt
            ptr2size[ptr] = size
            bt_stats[bt] <<< size
        }
    }
}

probe process("$^libluajit_path").function("lj_alloc_free")
{
    ptr = pointer_arg(2)
    if (tid() == target() && ptr && !quit) {
        bt = ptr2bt[ptr]
        delete ptr2bt[ptr]

        bytes = ptr2size[ptr]
        delete ptr2size[ptr]

        if (bt != "") {
            alloc_cnt--
            bt_stats[bt] <<< -bytes
            if (@sum(bt_stats[bt]) == 0) {
                delete bt_stats[bt]
            }
        }
    }
}
...

This script collects all memory not freed during the tracing period and corresponding backtraces, which supports both compiled code and interpretered lua code.

Note that it could not check memory allocated via ffi call. If lua code is clean, you could turn to use sample-bt-leaks.sxx to check possible memory leak outside lua code.

There is an interesting point here. Those alloc/free functions are not only called by the lua code, but also the lua vm itself. We could assume the vm itself has no memory leak, so how to exclude these noises?

In fact, the tapset luajit.sxx already does it for us.

The entrance of lua code from C is either lua_resume(), lua_pcall() or lua_call(). The exit of lua code is either lua_yield() or return.

The laujit vm has mainly below states:

  • LJ_VMST_INTERP default state
  • LJ_VMST_C, LJ_VMST_GC already in lua code
  • compiled already denotes you are in lua code, because each compiled trace are atomic execution.

The LJ_VMST_INTERP is a bit tricky, it does not means the lua vm is actually running lua code. When resume or call, it would set cframe, and when yield or return, cframe would be cleared.

Let’s check vm_x86.dasc (laujit x86/x86_64 assembly interpreter):

  |->vm_resume:				// Setup C frame and resume thread.
  ...
  |  lea KBASEa, [esp+CFRAME_RESUME]
  ...
  |  mov L:RB->cframe, KBASEa
  
  ...
  
  |.ffunc coroutine_yield
  ...
  |  xor RD, RD
  |  mov aword L:RB->cframe, RDa
  ...

The tapset would check the vm states and cframe to determine whether vm is excuting lua code, if not, the backtrace is returned as empty.

if (vmstate == -1 && !$*L->cframe) {
    return ""
}

if (vmstate == -1 || vmstate == -2 || vmstate == -3) {
    base = $*L->base
    bt .= luajit_debug_dumpstack(L, 0, $^arg_depth, base, simple)
}

A real case scenario of memory leak

A user found a memory leak issue.

1. Confirm

We should watch the memory usage in a whole first:

watch -n 3 'ps -p 91107 -o cmd,vsz,rss'

If over times, the memory keeps growing, then let’s check if the leak comes from the lua code:

./samples/lj-gc.sxx -D STP_NO_OVERLOAD -D MAXMAPENTRIES=150000 -D MAXACTION=10000 --skip-badvars -x 91107
...
Total GC count: 24407837 bytes

If the GC memory keeps growing, then it possibly comes from lua code.

2. Narrow down the scope

Use exclusion principle to narrow down the leak scope. If you are lucky, you could even solve the issue at this step.

3. Setup reproduce environment

It’s better to abstract a minimal and stable setup to reproduce the memory leak, e.g. constructing a minimal client workload, excluding unrelated dependencies, mocking a simple upstream server, etc.

4. Use systemtap to analyze

# mock the upstream
/usr/local/openresty/nginx/sbin/nginx -p $PWD -c nginx.conf -g "daemon off;"

# run workload
go run leak.go

# When the leak is stable, run systemtap script
./samples/lua-leaks.sxx -D STP_NO_OVERLOAD -D MAXMAPENTRIES=150000 -D MAXACTION=10000 --skip-badvars -arg time=60 -x 91107

The systemtap script outputs:

 T:@/opt/apisix//deps/share/lua/5.1/prometheus.lua:346
@/opt/apisix//deps/share/lua/5.1/prometheus.lua:346
@/opt/apisix//deps/share/lua/5.1/prometheus.lua:540
@/opt/apisix/apisix/plugins/prometheus/exporter.lua:210
@/opt/apisix/apisix/plugin.lua:884
@/opt/apisix/apisix/plugin.lua:961
@/opt/apisix/apisix/init.lua:314
@/opt/apisix/apisix/init.lua:725
=log_by_lua(nginx.conf:410):0
 : 0x0
        total 544320 bytes
 C:json_decode
C:json_protect_conversion
@/opt/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:770
@/opt/apisix/apisix/core/config_etcd.lua:123
@/opt/apisix/apisix/core/config_etcd.lua:298
@/opt/apisix/apisix/core/config_etcd.lua:538
builtin#22
@/opt/apisix/apisix/core/config_etcd.lua:516
 : 0xc
        total 358888 bytes
 builtin#177
=coroutine_api:7
@/opt/apisix//deps/share/lua/5.1/resty/http.lua:495
 : 0xb
        total 323136 bytes

It points out the possible memory leak sources!

With the hints, I check the source code and figure out the real leak reason:

In nginx-lua-prometheus library, each metric instance maintains a lookup table to cache the full metric name of each label values combination. The prometheus plugin would update metrics at the log phase of each request and in turn update the lookup table. Because the workload generates new routes repeatlly, where the route id in random string each time, and old routes gets discarded, so the lookup table is inserted with new items over and over (and those items would not be freeed even if they are no used anymore), which causes memory leak.

Based on the memory leak reason, we could make a bugfix easily.

Bonus topic: performance difference between systemtap and gdb

It’s been a long history of linux tracing. Linux has many tracing sources. But here we only talk about two of them:

  • uprobe/kprobe (most popular ones)
  • ptrace (from the ancient ages)

The relevant front-end tools:

  • systemtap

The systemtap script has awk-like syntax. The script would be compiled into kernel module. It mainly uses kernel events and uprobe/kprobe as breakpoints.

  • gdb

It makes use of ptrace to hook the target program/process.

In simplicity, interrupts from both uprobe/kprobe and ptrace obey similar procedures:

  • Replace PC with INT3 (x86 debug instruction)
  • trigger SIGTRAP, handled by gdb or kernel
  • access the address space and registers of the target
  • recovers PC from the original instrcution and continue

Although uprobe/kprobe and ptrace are similar, but they have completely different efficiency.

gdb handles SIGTRAP in user space, interacts with the target in PEEK-POKE way, which involves a lot of context switchs. gdb aims to be an interactive debug tool (the gdb script is also low efficiency).

But systemtap works in kernel space only, and it runs in assembly code.

Let’s take an simple example to visualize the runtime efficiency.

nginx.conf

error_log /dev/stderr info;
worker_processes 1;

events {}

http {
    server {
        listen 10000;

        location /run {
            content_by_lua_block {
                local t = {}
                for i=1,33 do
                    for i=1,500 do
                        table.insert(t, {i=i})
                    end
                end
            }
        }
    }
}

lua-leak.sxx

#!/usr/bin/env stap++

probe process("$^libluajit_path").function("lj_alloc_realloc").return,
      process("$^libluajit_path").function("lj_alloc_malloc").return
{
}

probe process("$^libluajit_path").function("lj_alloc_free")
{
}

probe timer.s($^arg_time)
{
    exit()
}

leak.gdb

set logging off
b lj_gc.c:875
command 1
    continue
end

b lj_alloc_free
command 2
    continue
end

continue

Test:

# run nginx
/usr/local/openresty/nginx/sbin/nginx -p $PWD -c nginx.conf -g "daemon off;"

# run without systemtap and gdb

time curl http://localhost:9080/leak
ok

real    0m0.019s
user    0m0.006s
sys     0m0.006s

# run systemtap

./samples/lua-leaks.sxx -x 36998

time curl http://localhost:9080/leak
ok

real    0m0.119s
user    0m0.007s
sys     0m0.007s

# run gdb
gdb -iex 'set pagination off' -batch-silent -x leak.gdb -p 36998

time curl http://localhost:9080/leak
ok

real    0m16.866s
user    0m0.006s
sys     0m0.006s

systemtap slows down the program by 10x, but gdb is 1600x! Of course, here the workload is dedicated and trigger breakpoints frequently. In other cases, e.g. CPU flamegraph, the systemtap overhead is small and acceptable.

Conclusion

Systemtap is a useful profiling tool to find out and analyze the issues, e.g. CPU flamegraph, context inspect, memory leak, etc.