OpenResty: record the lua execution time per request in ngx.ctx via LD_PRELOAD

storm_riders

TL;DR

How to record the lua execution time (excludes the time spent on async IO) per request in ngx.ctx, in a non-intrusive and zero-cost way?

The lua execution time is the sum of the following metrics:

  • lua_resume
    • phases
      • REWRITE
      • ACCESS
      • CONTENT
      • BALANCER
  • lua_pcall
    • filters
      • header_filter
      • body_filter

https://github.com/kingluo/ngx-lua-request-time

Issue to measure request time

Nginx has two variables to record request time:

  • $request_time

request processing time in seconds with a milliseconds resolution (1.3.9, 1.2.6); time elapsed since the first bytes were read from the client

  • $upstream_response_time

keeps time spent on receiving the response from the upstream server; the time is kept in seconds with millisecond resolution. Times of several responses are separated by commas and colons like addresses in the $upstream_addr variable.

Nginx is a multi-processes and single-threading model. The main thread of each worker process must be fully async-processing, otherwise, the performance is bad.

In OpenResty, our business logic is built mainly in lua code. If the code makes full use of async API, e.g. cosocket, run_worker_thread, then it meets the requirements of high-performance processing. However, in reality, the code may be huge and complex, and sometimes the downstream/upstream network goes wrong, and sometimes the downstream client and the upstream server may suck in the socket r/w.

So how to distinguish how much time we spend on lua code? How much time to handle the network r/w? And whether the lua code is blocking or not? Could we use $request_time - $upstream_response_time. Unfortunately, the answer is no!

Let’s have a look at the request flow of nginx first:

openresty_request_flow

The most misunderstanding of nginx is that you thought nginx will read the whole response body from upstream before it sends the body to the downstream, as well as the request body, so the $upstream_response_time is the pure upstream traffic time-cost. It’s completely wrong! In fact, the traffic pipe between upstream and downstream is done dynamically and incrementally.

Moreover, nginx will not read the request body from the downstream by default. Instead, when the headers are read, it triggers the phase handlers to handle this request, and after the balancer phase chooses an upstream node, it starts to read the request body and send it to the upstream gradually.

So let’s clarify those variables again:

  1. $request_time is the whole request duration, including the traffic time-cost between nginx and upstream/downstream. And it changes each time you read it, because when you read it, it uses the current timestamp (also cached one) to calculate the duration. It’s actually the time-cost from step1 to step6.
  2. $upstream_response_time is not pure time-cost between nginx and the upstream, it’s actually the time-cost from step2 to step6.

If we enable proxy buffering, it’s more complicated. The upstream may be done early, before the slow downstream get done. Anyways, $upstream_response_time contains time-cost overlapp of upstream and downstream, may be part of, may be the whole.

proxy_buffering on (by default)

In most use cases, the upstream is in the local network, so the network rate is faster than the downstream. Then we need to use buffering as much as possible so that the upstream could recycle the resource to handle the next request.

Nginx will try to read the response body from the upstream as much as possible into the memory buffers or even disk files, in parallel to writing to the downstream by draining the buffers.

If the downstream is too slow to sync with the upstream, then nginx will close the upstream first. In this case, $upstream_response_time is the time-cost of the upstream traffic and partial downstream traffic.

We could setup a demo to visualize this process.

mock_get.go

Send a GET, but do not receive the response.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
package main

import (
	"net"
)

func main() {
	conn, err := net.Dial("tcp", "localhost:9080")
	if err != nil {
		panic(err)
	}

	request := "GET /anything/a HTTP/1.1\r\nHost: httpbin.org\r\n\r\n"

	_, err = conn.Write([]byte(request))
	if err != nil {
		panic(err)
	}

	ch := make(chan struct{})
	<-ch
}

mock_get_big_respnose.go

Send a big response (4MB body, ensure it is over the sum of nginx buffers and kernel sockbuf).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
package main

import (
	"fmt"
	"net"
	"os"
)

const (
	CONN_HOST = "localhost"
	CONN_PORT = "3333"
	CONN_TYPE = "tcp"
)

func main() {
	l, err := net.Listen(CONN_TYPE, CONN_HOST+":"+CONN_PORT)
	if err != nil {
		fmt.Println("Error listening:", err.Error())
		os.Exit(1)
	}
	defer l.Close()
	fmt.Println("Listening on " + CONN_HOST + ":" + CONN_PORT)
	for {
		conn, err := l.Accept()
		if err != nil {
			fmt.Println("Error accepting: ", err.Error())
			os.Exit(1)
		}
		go handleRequest(conn)
	}
}

// Handles incoming request
func handleRequest(conn net.Conn) {
	buf := make([]byte, 1024*1024*4)
	_, err := conn.Read(buf)
	if err != nil {
		fmt.Println("Error reading:", err.Error())
	}
	for i := range buf {
		buf[i] = 'A'
	}
	headers := "HTTP/1.1 200 OK\r\nContent-Length: 4194304\r\n\r\n"
	conn.Write([]byte(headers))
	conn.Write(buf)
	conn.Close()
}

nginx.conf

daemon off;
error_log /dev/stderr info;

events {}

http {
    upstream backend {
        server 127.0.0.1:3333;
        keepalive 320;
        keepalive_requests 1000;
        keepalive_timeout 60s;
    }

    log_format main escape=default '$remote_addr - $remote_user [$time_local] $http_host "$request" $status $body_bytes_sent $request_time "$http_referer" "$http_user_agent" $upstream_addr $upstream_status $upstream_response_time';

    access_log logs/access.log main;

    send_timeout 10s;

    server {
        listen 20000;
        location / {
            proxy_pass http://backend;
            proxy_http_version 1.1;
            proxy_set_header Connection "";
        }
    }
}
1
2
3
4
5
6
7
8
# in first terminal
go run mock_get_big_respnose.go

# in second terminal
nginx -p $PWD -c nginx.conf

# in third terminal
go run mock_get.go

Watch logs/access.log:

127.0.0.1 - - [20/Apr/2023:18:39:13 +0800] httpbin.org "GET /anything/a HTTP/1.1" 200 2555860 10.016 "-" "-" 127.0.0.1:3333 200 0.015

You could see that $upstream_response_time is 0.015 secs, and the $request_time is 10.016 secs.

TCP flow between the upstream and nginx:

proxy_buffering_request_flow

$upstream_response_time is the time-cost from step1 to step2 only.

Check the nginx source code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
static void
ngx_http_upstream_process_request(ngx_http_request_t *r,
    ngx_http_upstream_t *u)
{
...
        // the upstream body reading is finished.
        // close the upstream request
        if (p->upstream_done || p->upstream_eof || p->upstream_error) {
            ngx_log_debug1(NGX_LOG_DEBUG_HTTP, r->connection->log, 0,
                           "http upstream exit: %p", p->out);

            if (p->upstream_done
                || (p->upstream_eof && p->length == -1))
            {
                ngx_http_upstream_finalize_request(r, u, 0);
                return;
            }
...
        }
...
}

static void
ngx_http_upstream_finalize_request(ngx_http_request_t *r,
    ngx_http_upstream_t *u, ngx_int_t rc)
{
...
    // close the associated downstream request
    ngx_http_finalize_request(r, rc);
}

void
ngx_http_finalize_request(ngx_http_request_t *r, ngx_int_t rc)
{
...
    // if the downstream writing is still blocking,
    // setup a postponed handler to continue writing.
    if (r->buffered || c->buffered || r->postponed) {

        if (ngx_http_set_write_handler(r) != NGX_OK) {
            ngx_http_terminate_request(r, 0);
        }

        return;
    }
...
}

proxy_buffering off

Modify nginx.conf:

...
        location / {
            proxy_pass http://backend;
            proxy_http_version 1.1;
            proxy_set_header Connection "";
            proxy_buffering off;
        }
...

Watch logs/access.log:

127.0.0.1 - - [20/Apr/2023:18:48:46 +0800] httpbin.org "GET /anything/a HTTP/1.1" 200 4157440 10.022 "-" "-" 127.0.0.1:3333 200 10.021

You could see that $upstream_response_time is 10.021 secs, and the $request_time is 10.022 secs, which are almost the same. Now $upstream_response_time fully covers the downstream traffic, i.e. including the step3.

Solution

We need a metric to calculate the execution time purely on lua code, excluding other factors. So that we could ensure our business logic is fully async.

We have a few solutions:

  • systemtap
    • heavyweight, you need to install debug symbols, sometimes it’s not feasible, e.g. lack of symbols for custom kernel
    • Non-negligible overhead, at least 5% performance impact
    • not container friendly
    • could not write the userspace, i.e. could not set lua variable
  • record and calculate in lua code manually
    • it’s hard and tricky to handle lua errors and coroutine yield
    • code pollution

Could we have a non-intrusive and zero-cost way to record the lua execution time?

Maybe LD_PRELOAD is a better choice.

https://github.com/kingluo/ngx-lua-request-time

demo

Use the sleep function from libc to do blocking sleep for 1 second, and log the ngx.ctx.openresty_request_time_us to confirm it’s blocked by 1 second.

nginx.conf

server {
    error_log /dev/stderr info;

    listen 20000;

    location / {
        access_by_lua_block {
            local ffi = require("ffi")
            ffi.cdef[[unsigned int sleep(unsigned int seconds);]]
            ffi.C.sleep(1);
        }
        proxy_pass https://httpbin.org;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }

    log_by_lua_block {
        ngx.log(ngx.WARN, "ngx.ctx.openresty_request_time_us: ", ngx.ctx.openresty_request_time_us)
    }
}

Test:

1
2
3
4
5
NGX_LUA_REQUEST_TIME_VAR_NAME=openresty_request_time_us \
LD_PRELOAD=/opt/ngx-lua-request-time/ngx_lua_request_time.so \
openresty -p $PWD -c nginx.conf

curl http://localhost:20000/get

Output:

2023/04/16 09:59:43 [warn] 2868937#2868937: *3 [lua] log_by_lua(server.conf:17):2: ngx.ctx.openresty_request_time_us: 1000120 while logging request, client: 127.0.0.1, server: , request: “GET /get HTTP/1.1”, upstream: “https://34.193.132.77:443/get”, host: “localhost:20000”

The LD_PRELOAD could be also enabled globally via /etc/ld.so.preload. So it’s also non-intrusive for docker usage.

ld.so.preload

/opt/ngx-lua-request-time/ngx_lua_request_time.so
1
2
3
4
5
6
7
docker run --rm --name openresty \
-v $PWD/server.conf:/opt/bitnami/openresty/nginx/conf/server_blocks/my_server_block.conf:ro \
-v $PWD/../../ld.so.preload:/etc/ld.so.preload:ro \
-v $PWD/../../ngx_lua_request_time.so:/opt/ngx-lua-request-time/ngx_lua_request_time.so:ro \
-e NGX_LUA_REQUEST_TIME_VAR_NAME=openresty_request_time_us \
-p 20000:20000 \
bitnami/openresty:latest

design

It’s worth noting to use LD_PRELOAD, it needs to satisfy below conditions:

  • the hooked C symbol is public, and the dependent does not link it via offset, i.e. the other symbols in the same elf file use offset to locate this target symbol, although it’s public.
  • the function input arguments and return value contains enough information to do your analysis
  • it’s possible to filter unneeded contexts because some low-level function is used by many high-level logics, and some of them are noise

In OpenResty, it enables to inject handlers in different phases of the request life cycle. All async phases (that is, such phase runs your lua code in a coroutine, which could yield for async IO and resume later) uses lua_resume() function, and all recursive coroutine resumes are flattened by OpenResty, i.e. when you create a new coroutine to access LDAP server via cosocket from the async phase handler, it will exit the entry coroutine (the handler itself) and resumes the extra coroutine there. So, calculate the duration in lua_resume() will be not overlapped with each other.

Note that it excludes the time spent on async IO! It only contains the pure lua code execution time.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
int lua_resume(lua_State *L, int narg) {
    if (!is_enabled()) {
        return lua_resume_f(L, narg);
    }

    ngx_http_request_t *r = ngx_http_lua_get_request_f(L);
    if (!r) {
        return lua_resume_f(L, narg);
    }

    char* err;
    int phase = ngx_http_lua_ffi_get_phase_f(r, &err);
    if (phase == NGX_ERROR ||
		(phase != PHASE_SET &&
			phase != PHASE_REWRITE &&
			phase != PHASE_ACCESS &&
			phase != PHASE_CONTENT)) {
        return lua_resume_f(L, narg);
    }

    timespec tv1;
    clock_gettime(CLOCK_MONOTONIC, &tv1);

    int ret = lua_resume_f(L, narg);

    timespec tv2;
    clock_gettime(CLOCK_MONOTONIC, &tv2);

    record_time(&tv1, &tv2, L);

    return ret;
}

All sync phases, e.g. balancer phase, header/body filter phase uses lua_pcall(), but this is a very low-level function, so to minimize the overhead, you need a cheap method to filter the noise.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
int lua_pcall(lua_State *L, int nargs, int nresults, int errfunc) {
    if (!is_enabled()) {
        return lua_pcall_f(L, nargs, nresults, errfunc);
    }

    // all phases uses ngx_http_lua_traceback as the pcall error handler,
    // so that's a great point used to filter out the other `lua_pcall` invocations.
    // And, this checking is cheap!
    if (lua_tocfunction_f(L, 1) != ngx_http_lua_traceback_f) {
        return lua_pcall_f(L, nargs, nresults, errfunc);
    }
    ...
}

After testing, the cpu overhead is small enough to be ignored.