Use systemtap to analyze the memory leak of lua code
The memory used by lua code is managed by the GC, not calling malloc/free/mmap
directly.
The luajit GC uses mark-and-sweep algorithm. In simple words, it links all allocated gc objects in a
global list. When the memory is in pressure, it would trigger recycle procedures upon that list. Note that
because of single-threading design, the GC is interspersed in the flow of lua code (either compiled
or interpretered), e.g. after string allocation. GC would try to check for each gc
object if it is still in used, e.g. as an upvalue, reside in a stack, etc. The GC would sweep all dangling gc objects at the end, and the corresponding memory would be freed.
In luajit, lj_alloc_malloc
and lj_alloc_realloc
would allocate gc objects, and lj_alloc_free
would
free them. So naturally, we could trace the invocations of these functions to check memory leak.
Note that triggering gc is a bit uncertain and the gc steps would be splited in different time slices. It would not free unused memory immediately. Moreover, the lua code may use caches. So it’s necessary to check the source code to distinguish the real memory leak.
systemtap script
I would use stapxx to analyze lua code running in openresty.
lua-leak.sxx
https://gist.github.com/kingluo/1df4736665a3381ce745f26cf17181d8#file-lua-leak-sxx
...
probe process("$^libluajit_path").function("lj_alloc_realloc").return,
process("$^libluajit_path").function("lj_alloc_malloc").return
{
if (gc_size_begin == 0) {
gc_size_begin = get_gc_size()
}
gc_size_end = get_gc_size()
if (tid() == target() && !quit) {
ptr = returnval()
bt = get_bt()
if (ptr && bt != "") {
if (alloc_cnt >= 100000) {
report()
}
if (ptr2bt[ptr] == "") {
alloc_cnt++
}
size = @entry($nsize)
ptr2bt[ptr] = bt
ptr2size[ptr] = size
bt_stats[bt] <<< size
}
}
}
probe process("$^libluajit_path").function("lj_alloc_free")
{
ptr = pointer_arg(2)
if (tid() == target() && ptr && !quit) {
bt = ptr2bt[ptr]
delete ptr2bt[ptr]
bytes = ptr2size[ptr]
delete ptr2size[ptr]
if (bt != "") {
alloc_cnt--
bt_stats[bt] <<< -bytes
if (@sum(bt_stats[bt]) == 0) {
delete bt_stats[bt]
}
}
}
}
...
This script collects all memory not freed during the tracing period and corresponding backtraces, which supports both compiled code and interpretered lua code.
Note that it could not check memory allocated via ffi call. If lua code is clean, you could turn to use sample-bt-leaks.sxx
to check possible memory leak outside lua code.
There is an interesting point here. Those alloc/free functions are not only called by the lua code, but also the lua vm itself. We could assume the vm itself has no memory leak, so how to exclude these noises?
In fact, the tapset luajit.sxx
already does it for us.
The entrance of lua code from C is either lua_resume()
, lua_pcall()
or lua_call()
.
The exit of lua code is either lua_yield()
or return
.
The laujit vm has mainly below states:
LJ_VMST_INTERP
default stateLJ_VMST_C
,LJ_VMST_GC
already in lua codecompiled
already denotes you are in lua code, because each compiled trace are atomic execution.
The LJ_VMST_INTERP
is a bit tricky, it does not means the lua vm is actually running lua code.
When resume
or call
, it would set cframe
, and when yield
or return
, cframe
would be cleared.
Let’s check vm_x86.dasc
(laujit x86/x86_64 assembly interpreter):
|->vm_resume: // Setup C frame and resume thread.
...
| lea KBASEa, [esp+CFRAME_RESUME]
...
| mov L:RB->cframe, KBASEa
...
|.ffunc coroutine_yield
...
| xor RD, RD
| mov aword L:RB->cframe, RDa
...
The tapset would check the vm states and cframe to determine whether vm is excuting lua code, if not, the backtrace is returned as empty.
|
|
A real case scenario of memory leak
A user found a memory leak issue.
1. Confirm
We should watch the memory usage in a whole first:
|
|
If over times, the memory keeps growing, then let’s check if the leak comes from the lua code:
./samples/lj-gc.sxx -D STP_NO_OVERLOAD -D MAXMAPENTRIES=150000 -D MAXACTION=10000 --skip-badvars -x 91107
...
Total GC count: 24407837 bytes
If the GC memory keeps growing, then it possibly comes from lua code.
2. Narrow down the scope
Use exclusion principle to narrow down the leak scope. If you are lucky, you could even solve the issue at this step.
3. Setup reproduce environment
It’s better to abstract a minimal and stable setup to reproduce the memory leak, e.g. constructing a minimal client workload, excluding unrelated dependencies, mocking a simple upstream server, etc.
4. Use systemtap to analyze
|
|
The systemtap script outputs:
T:@/opt/apisix//deps/share/lua/5.1/prometheus.lua:346
@/opt/apisix//deps/share/lua/5.1/prometheus.lua:346
@/opt/apisix//deps/share/lua/5.1/prometheus.lua:540
@/opt/apisix/apisix/plugins/prometheus/exporter.lua:210
@/opt/apisix/apisix/plugin.lua:884
@/opt/apisix/apisix/plugin.lua:961
@/opt/apisix/apisix/init.lua:314
@/opt/apisix/apisix/init.lua:725
=log_by_lua(nginx.conf:410):0
: 0x0
total 544320 bytes
C:json_decode
C:json_protect_conversion
@/opt/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:770
@/opt/apisix/apisix/core/config_etcd.lua:123
@/opt/apisix/apisix/core/config_etcd.lua:298
@/opt/apisix/apisix/core/config_etcd.lua:538
builtin#22
@/opt/apisix/apisix/core/config_etcd.lua:516
: 0xc
total 358888 bytes
builtin#177
=coroutine_api:7
@/opt/apisix//deps/share/lua/5.1/resty/http.lua:495
: 0xb
total 323136 bytes
It points out the possible memory leak sources!
With the hints, I check the source code and figure out the real leak reason:
In nginx-lua-prometheus library, each metric instance maintains a lookup table to cache the full metric name of each label values combination. The prometheus plugin would update metrics at the log phase of each request and in turn update the lookup table. Because the workload generates new routes repeatlly, where the route id in random string each time, and old routes gets discarded, so the lookup table is inserted with new items over and over (and those items would not be freeed even if they are no used anymore), which causes memory leak.
Based on the memory leak reason, we could make a bugfix easily.
Bonus topic: performance difference between systemtap and gdb
It’s been a long history of linux tracing. Linux has many tracing sources. But here we only talk about two of them:
- uprobe/kprobe (most popular ones)
- ptrace (from the ancient ages)
The relevant front-end tools:
- systemtap
The systemtap script has awk-like syntax. The script would be compiled into kernel module. It mainly uses kernel events and uprobe/kprobe as breakpoints.
- gdb
It makes use of ptrace to hook the target program/process.
In simplicity, interrupts from both uprobe/kprobe and ptrace obey similar procedures:
- Replace
PC
withINT3
(x86 debug instruction) - trigger
SIGTRAP
, handled by gdb or kernel - access the address space and registers of the target
- recovers
PC
from the original instrcution and continue
Although uprobe/kprobe and ptrace are similar, but they have completely different efficiency.
gdb handles SIGTRAP
in user space, interacts with the target in PEEK-POKE
way, which
involves a lot of context switchs. gdb aims to be an interactive debug tool (the gdb script is also
low efficiency).
But systemtap works in kernel space only, and it runs in assembly code.
Let’s take an simple example to visualize the runtime efficiency.
nginx.conf
error_log /dev/stderr info;
worker_processes 1;
events {}
http {
server {
listen 10000;
location /run {
content_by_lua_block {
local t = {}
for i=1,33 do
for i=1,500 do
table.insert(t, {i=i})
end
end
}
}
}
}
lua-leak.sxx
#!/usr/bin/env stap++
probe process("$^libluajit_path").function("lj_alloc_realloc").return,
process("$^libluajit_path").function("lj_alloc_malloc").return
{
}
probe process("$^libluajit_path").function("lj_alloc_free")
{
}
probe timer.s($^arg_time)
{
exit()
}
leak.gdb
set logging off
b lj_gc.c:875
command 1
continue
end
b lj_alloc_free
command 2
continue
end
continue
Test:
|
|
systemtap slows down the program by 10x, but gdb is 1600x! Of course, here the workload is dedicated and trigger breakpoints frequently. In other cases, e.g. CPU flamegraph, the systemtap overhead is small and acceptable.
Conclusion
Systemtap is a useful profiling tool to find out and analyze the issues, e.g. CPU flamegraph, context inspect, memory leak, etc.