Nginx Blocking

clock

What’s blocking?

Nginx has one master process and multiple worker processes, and each worker process is single threading.

It’s well known that NGINX uses an asynchronous, event‑driven approach to handling connections. This means that instead of creating another dedicated process or thread for each request (like servers with a traditional architecture), it handles multiple connections and requests in one worker process. To achieve this, NGINX works with sockets in a non‑blocking mode and uses efficient methods such as epoll and kqueue.

However, once the connection is accepted in one worker process, it could not be moved to other processes, even when that process is busy. Not like golang, it has no work stealing. And, only socket is asynchronous, the cpu-intensive tasks and file io tasks is still synchronous.

Every phase handler for an http request runs in a single thread. OpenResty allows you to run lua code in each phase. The rewrite, access, content phases, as well as the timer (implemented as faked connection) are running lua code within an individual lua coroutine. Within the coroutine, you could use cosocket to communicate with the world. The cosocket is 100% nonblocking out of the box. When you call cosocket API and it needs to wait for more data, it would yield the coroutine, register corresponding event in epoll and let the event-loop handles other events.

As the name suggests, coroutines use the CPU in cooperative way. When you block a long time within one coroutine, other coroutines have no way to get executed, because all coroutines reside in the one thread, which results in uneven request handling and long request delay.

When I work in Kugou Music, my team encounters blocking issue, and I wrote a simple tool to diagnose it. This tool makes use of LD_PRELOAD to hook the lua_resume and check the function execution time, and print the url and entry lua function for long execution.

In fact, this tool is not good, because it could not print the backtrace and does not work for compiled lua code.

Locate the blocking source

The best tool to check blocking should be systemtap.

The blocking comes from two things:

  • cpu-intensive tasks
  • blocking system calls

So just check the execution time, if the time exceed a threshold, print the backtraces collected during that time.

You could find the example code here:

https://gist.github.com/kingluo/50b8589069f5f11a78d2136c1155d613

workload.lua

The workload simulates both cpu-intensive tasks (random string generation and MD5 calculation) and calls blocking system calls (blocking socket to access httpbin).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
local http = require("socket.http")
local cjson = require("cjson")

math.randomseed(os.time())

local _M = {}

local function random_str(length)
    local res = ""
    for i = 1, length do
        res = res .. string.char(math.random(97, 122))
    end
    return res
end

local function run_md5()
    local cnt = 0
    local buf = random_str(81920)
    for _= 1,1000 do
        cnt = cnt + #ngx.md5(buf)
    end
    return cnt
end

function _M.run()
    local cnt = run_md5()
    ngx.say(cnt)
    local r, c, h = http.request {
        method = "GET",
        url = "http://httpbin.org/delay/1"
    }
    assert(r == 1)
    ngx.say("status=", c)
    ngx.say(cjson.encode(h))
end

return _M

nginx.conf

error_log /dev/stderr info;
worker_processes 1;

events {}

http {
    server {
        listen 10000;

        location /run {
            content_by_lua_block {
                require("workload").run()
            }
        }
    }
}

blocking-bt.sxx

It tracks the lua_resume (the entry of coroutine) and syscalls.

...
probe process("$^libluajit_path").function("lua_resume").return
{
    if (tid() == target()) {
        tracing = 0
        elapsed = gettimeofday_ms() - start_time

        if (elapsed >= $^arg_min_elapsed :default(200)) {
            printf("--- lua_resume block %d ms\n", elapsed)
            banner = 0
            foreach (bt in bts- limit $^arg_limit :default(3)) {
                cnt = @count(bts[bt])
                if (cnt <= $^arg_min_bt :default(2)) {
                    break
                }
                if (banner == 0) {
                    printf(">>> Hot CPU backtraces\n")
                    banner = 1
                }
                printf("%s\t%d\n", bt, cnt)
            }

            banner = 0
            foreach (lbt in syscall_bts- limit $^arg_syscall_limit :default(2)) {
                total_us = @sum(syscall_bts[lbt])
                if (total_us >= 1000) {
                    if (banner == 0) {
                        printf(">>> Time-consuming syscall backtraces\n")
                        banner = 1
                    }
                    printf("%s\t%d us\n", lbt, total_us)
                }
            }
            printf("\n")
        }

        delete bts
        delete syscall_bts
    }
}

probe kernel.trace("sys_enter") {
    if (tid() == target() && tracing) {
        syscall_start_time = gettimeofday_us()
    }
}

probe kernel.trace("sys_exit") {
    if (tid() == target() && tracing) {
        elapsed = gettimeofday_us() - syscall_start_time
        if (elapsed >= $^arg_syscall_min_elapsed_us :default(400)) {
            syscall_bts[lua_bt()] <<< elapsed
        }
    }
}

Test

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# run openresty
/usr/local/openresty/nginx/sbin/nginx -p $PWD -c nginx.conf -g "daemon off;"

# run systemtap script
./samples/blocking-bt.sxx -D STP_NO_OVERLOAD --arg time=60 --skip-badvars -x 144081

# trigger the workload
time curl localhost:10000/run
32000
status=200
{"access-control-allow-credentials":"true","connection":"close","content-type":"application\/json","server":"gunicorn\/19.9.0","access-control-allow-origin":"*","date":"Thu, 29 Sep 2022 10:30:42 GMT","content-length":"287"}

real    0m3.090s
user    0m0.006s
sys     0m0.006s

# then you could see the systemtap output
--- lua_resume block 2424 ms
>>> Hot CPU backtraces
__memcpy_avx_unaligned_erms
@./workload.lua:8
@./workload.lua:18
@./workload.lua:26
=content_by_lua(nginx.conf:17):2
        367
ngx_md5_body
@./workload.lua:16
@./workload.lua:26
=content_by_lua(nginx.conf:17):2
        169
propagatemark
@./workload.lua:8
@./workload.lua:18
@./workload.lua:26
=content_by_lua(nginx.conf:17):2
        38
>>> Time-consuming syscall backtraces
0x0
C:meth_connect
@/usr/local/share/lua/5.1/socket/http.lua:130
@/usr/local/share/lua/5.1/socket/http.lua:354
C:protected_
@./workload.lua:28
=content_by_lua(nginx.conf:17):2
        235173 us
0x0
C:meth_receive
@/usr/local/share/lua/5.1/socket/http.lua:160
@/usr/local/share/lua/5.1/socket/http.lua:362
C:protected_
@./workload.lua:28
=content_by_lua(nginx.conf:17):2
        1472811 us

We get the expected blocking source: the blocking time is about 3 seconds. From the backtrace, we know the blocking comes from MD5 calculation, random string generation and connect()/receive() syscalls.

How to slove blocking?

Since the main thread is blocking sensitive, obviously, we should delegate blocking jobs to other threads.

The nginx provides the thread pool, which is originally used for serving static files more efficiently. I wrap them in lua, so that you could use nonblocking API to execute blocking stuff.

I contribute my works to OpenResty official, that is ngx.run_worker_thread() API:

https://github.com/openresty/lua-nginx-module#ngxrun_worker_thread

Each thread in the thread pool would be associated an individual luajit vm. ngx.run_worker_thread() would send the task to one thread to get executed, yield the coroutine, and when the task get done, it would resume the coroutine.

This API sloves blocking headaches for many companys.

However, it’s far from perfect:

  • the lua vm allocation is random, for each execution, you get a brand new environment, so no way to cache resident resources, e.g. keepalive tcp connection. Every time, you need to init everything.
  • the throughput is limited by the number of the thread pool, and context switch overhead can not be ignored.
  • only small parts (string, hashing, shared dict) of ngx APIs could be used in the separate luajit vm

Could we make a perfect way? So that we could both relax the main thread and make use of ecosystem from other mainstream programming languages?

The answer is lua-resty-nonblocking-ffi library. It provides an efficient, generic and noblocking API to do hybrid programming in openresty with mainstream languages (Go, Python, Java, etc.).

Features:

  • simple but extensible interface, supports any C ABI compliant language
  • once and for all, no need to write C/Lua codes to do coupling anymore
  • high performance, 3~4 times faster than unix domain socket way
  • shim for python/java, once and for all
  • any serialization message format you like

This library has been used successfully by several commercial companies for over two years. My customers come from different domains, including AI, IOT, Edge computing.

I would open source this library in near future, please stay tuned.

Conclusion

The blocking is nginx specific issue. Be aware of it, otherwise, you would not benefit high performance from OpenResty.

For the time being, use ngx.run_worker_thread() to offload your blocking tasks. In future, lua-resty-nonblocking-ffi is your best choice.