OpenResty: record the Lua execution time per request in ngx.ctx via LD_PRELOAD

storm_riders

TL;DR

How can I log the Lua execution time of each request (excluding time spent on asynchronous IO) in ngx.ctx in a non-intrusive, zero-cost way?

The Lua execution time is the sum of the following metrics:

  • lua_resume
    • phases
      • REWRITE
      • ACCESS
      • CONTENT
      • BALANCER
  • lua_pcall
    • filters
      • header_filter
      • body_filter

https://github.com/kingluo/ngx-lua-request-time

Issue to measure request time

Nginx has two variables to record request time:

  • $request_time

request processing time in seconds with a milliseconds resolution (1.3.9, 1.2.6); time elapsed since the first bytes were read from the client

  • $upstream_response_time

keeps time spent on receiving the response from the upstream server; the time is kept in seconds with millisecond resolution. Times of several responses are separated by commas and colons like addresses in the $upstream_addr variable.

Nginx is a multi-processes and single-threading model. The main thread of each worker process must be fully async-processing, otherwise, the performance is bad.

In OpenResty, our business logic is built mainly in Lua code. If the code makes full use of async API, e.g. cosocket, run_worker_thread, then it meets the requirements of high-performance processing. However, in reality, the code may be huge and complex, and sometimes the downstream/upstream network goes wrong, and sometimes the downstream client and the upstream server may suck in the socket r/w.

So how to distinguish how much time we spend on Lua code?

How much time to handle the network r/w? And whether the lua code is blocking or not?

Could we use $request_time - $upstream_response_time? Unfortunately, the answer is no!

Let’s first take a look at Nginx’s request life cycle:

openresty_request_flow

The most common misunderstanding of nginx is that you thought nginx would read the whole response body from upstream before it sends the body to the downstream, as well as the request body, so the $upstream_response_time is the pure upstream traffic time-cost. But actually, it depends on the parameter configuration. By default, the traffic pipeline between the upstream and downstream of the request and response body is done dynamically and incrementally.

Moreover, nginx will not read the request body from the downstream before all following phases, e.g. rewrite phase. Instead, when the headers are read, it triggers the phase handlers to handle this request, and after the balancer phase chooses an upstream node, it starts to read the request body and send it to the upstream gradually.

Also note that after reading the request headers, Nginx stops reading the request body until the content phase completes or is explicitly triggered in the lua code.

So let’s clarify the time-cost relationship again:

proxy_request_buffering proxy_buffering $upstream_response_time $request_time - $upstream_response_time
on on upstream_time_cost downstream_time_cost + lua_time_cost
off on upstream_time_cost + downstream_request_body_time_cost downstream_response_body_time_cost + lua_time_cost
on off upstream_time_cost + downstream_response_body_time_cost downstream_request_body_time_cost + lua_time_cost
off off upstream_time_cost + downstream_time_cost lua_time_cost

You can see that the definition of $upstream_response_time changes depending on the proxy_request_buffering and proxy_buffering settings.

So, the motivation is, how to measure Lua time cost without being affected by directives settings?

Important notes:

  1. Interestingly, you may find that request_time is smaller than upstream_response_time! how come?

Here is an example:

127.0.0.1 - - [15/Apr/2023:20:05:32 +0800] 127.0.0.1:9080 "GET /get HTTP/1.1" 200 463 0.001 "-" "curl/7.68.0" 172.30.74.11:80 200 0.002 "http://127.0.0.1:9080"

$request_time is 0.001, while $upstream_response_time is 0.002!

This is caused by the two metrics using different timing sources.

$upstream_response_time calculated by clock_gettime(CLOCK_MONOTONIC_COARSE), and by default it can be in past for 4 milliseconds, on opposite, $request_time calculated by gettimeofday(). So eventually upstream_response_time might be larger, than response_time.

  1. $request_time is generated in real-time, so different values may be obtained at different times.
  2. The access log is written after the Lua log phase handler is completed, so there should be no blocking tasks in the Lua code, otherwise the $request_time in the access log will be larger than the actual value.

proxy_buffering on (by default)

In most use cases, the upstream is in the local network, so the network rate is faster than the downstream. Then we need to use buffering as much as possible so that the upstream could recycle the resource to handle the next request.

Nginx will try to read the response body from the upstream as much as possible into the memory buffers or even disk files, in parallel to writing to the downstream by draining the buffers.

If the downstream is too slow to sync with the upstream, then nginx will close the upstream first. In this case, $upstream_response_time is the time-cost of the upstream traffic and partial downstream traffic.

We could set up a demo to visualize this process.

mock_get.go

Send a GET, but do not receive the response.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
package main

import (
	"net"
)

func main() {
	conn, err := net.Dial("tcp", "localhost:9080")
	if err != nil {
		panic(err)
	}

	request := "GET /anything/a HTTP/1.1\r\nHost: httpbin.org\r\n\r\n"

	_, err = conn.Write([]byte(request))
	if err != nil {
		panic(err)
	}

	ch := make(chan struct{})
	<-ch
}

mock_get_big_respnose.go

Send a big response (4MB body, ensure it is over the sum of nginx buffers and kernel sockbuf).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
package main

import (
	"fmt"
	"net"
	"os"
)

const (
	CONN_HOST = "localhost"
	CONN_PORT = "3333"
	CONN_TYPE = "tcp"
)

func main() {
	l, err := net.Listen(CONN_TYPE, CONN_HOST+":"+CONN_PORT)
	if err != nil {
		fmt.Println("Error listening:", err.Error())
		os.Exit(1)
	}
	defer l.Close()
	fmt.Println("Listening on " + CONN_HOST + ":" + CONN_PORT)
	for {
		conn, err := l.Accept()
		if err != nil {
			fmt.Println("Error accepting: ", err.Error())
			os.Exit(1)
		}
		go handleRequest(conn)
	}
}

// Handles incoming request
func handleRequest(conn net.Conn) {
	buf := make([]byte, 1024*1024*4)
	_, err := conn.Read(buf)
	if err != nil {
		fmt.Println("Error reading:", err.Error())
	}
	for i := range buf {
		buf[i] = 'A'
	}
	headers := "HTTP/1.1 200 OK\r\nContent-Length: 4194304\r\n\r\n"
	conn.Write([]byte(headers))
	conn.Write(buf)
	conn.Close()
}

nginx.conf

daemon off;
error_log /dev/stderr info;

events {}

http {
    upstream backend {
        server 127.0.0.1:3333;
        keepalive 320;
        keepalive_requests 1000;
        keepalive_timeout 60s;
    }

    log_format main escape=default '$remote_addr - $remote_user [$time_local] $http_host "$request" $status $body_bytes_sent $request_time "$http_referer" "$http_user_agent" $upstream_addr $upstream_status $upstream_response_time';

    access_log logs/access.log main;

    send_timeout 10s;

    server {
        listen 20000;
        location / {
            proxy_pass http://backend;
            proxy_http_version 1.1;
            proxy_set_header Connection "";
        }
    }
}
1
2
3
4
5
6
7
8
# in first terminal
go run mock_get_big_respnose.go

# in second terminal
nginx -p $PWD -c nginx.conf

# in third terminal
go run mock_get.go

Watch logs/access.log:

127.0.0.1 - - [20/Apr/2023:18:39:13 +0800] httpbin.org "GET /anything/a HTTP/1.1" 200 2555860 10.016 "-" "-" 127.0.0.1:3333 200 0.015

You can see that $upstream_response_time is 0.015 secs, and the $request_time is 10.016 secs.

TCP flow between the upstream and nginx:

proxy_buffering_request_flow

$upstream_response_time is the time-cost from step1 to step2 only.

Check the nginx source code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
static void
ngx_http_upstream_process_request(ngx_http_request_t *r,
    ngx_http_upstream_t *u)
{
...
        //The upstream body reading is finished.
        //Close the upstream request
        if (p->upstream_done || p->upstream_eof || p->upstream_error) {
            ngx_log_debug1(NGX_LOG_DEBUG_HTTP, r->connection->log, 0,
                           "http upstream exit: %p", p->out);

            if (p->upstream_done
                || (p->upstream_eof && p->length == -1))
            {
                ngx_http_upstream_finalize_request(r, u, 0);
                return;
            }
...
        }
...
}

static void
ngx_http_upstream_finalize_request(ngx_http_request_t *r,
    ngx_http_upstream_t *u, ngx_int_t rc)
{
...
    //Close the associated downstream request
    ngx_http_finalize_request(r, rc);
}

void
ngx_http_finalize_request(ngx_http_request_t *r, ngx_int_t rc)
{
...
    //If the downstream writing is still blocking,
    //set up a postponed handler to continue writing.
    if (r->buffered || c->buffered || r->postponed) {

        if (ngx_http_set_write_handler(r) != NGX_OK) {
            ngx_http_terminate_request(r, 0);
        }

        return;
    }
...
}

proxy_buffering off

Modify nginx.conf:

...
        location / {
            proxy_pass http://backend;
            proxy_http_version 1.1;
            proxy_set_header Connection "";
            proxy_buffering off;
        }
...

Watch logs/access.log:

127.0.0.1 - - [20/Apr/2023:18:48:46 +0800] httpbin.org "GET /anything/a HTTP/1.1" 200 4157440 10.022 "-" "-" 127.0.0.1:3333 200 10.021

You can see that $upstream_response_time is 10.021 secs, and the $request_time is 10.022 secs, which are almost the same. Now $upstream_response_time fully covers the downstream traffic, i.e. including the step3.

Solution

We need a metric to calculate the execution time purely on lua code, excluding other factors. So that we could ensure our business logic is fully async.

We have a few solutions:

  • systemtap
    • heavyweight, you need to install debug symbols, sometimes it’s not feasible, e.g. lack of symbols for custom kernel
    • Non-negligible overhead, at least 5% performance impact
    • not container-friendly
    • could not write the userspace, i.e. could not set the lua variable
  • record and calculate in Lua code manually
    • it’s hard and tricky to handle lua errors and coroutine yield
    • code pollution

Could we have a non-intrusive and zero-cost way to record the Lua execution time?

Maybe LD_PRELOAD is a better choice.

https://github.com/kingluo/ngx-lua-request-time

demo

Use the sleep function from libc to do blocking sleep for 1 second, and log the ngx.ctx.openresty_request_time_us to confirm it’s blocked by 1 second.

nginx.conf

server {
    error_log /dev/stderr info;

    listen 20000;

    location / {
        access_by_lua_block {
            local ffi = require("ffi")
            ffi.cdef[[unsigned int sleep(unsigned int seconds);]]
            ffi.C.sleep(1);
        }
        proxy_pass https://httpbin.org;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }

    log_by_lua_block {
        ngx.log(ngx.WARN, "ngx.ctx.openresty_request_time_us: ", ngx.ctx.openresty_request_time_us)
    }
}

Test:

1
2
3
4
5
NGX_LUA_REQUEST_TIME_VAR_NAME=openresty_request_time_us \
LD_PRELOAD=/opt/ngx-lua-request-time/ngx_lua_request_time.so \
openresty -p $PWD -c nginx.conf

curl http://localhost:20000/get

Output:

2023/04/16 09:59:43 [warn] 2868937#2868937: *3 [lua] log_by_lua(server.conf:17):2: ngx.ctx.openresty_request_time_us: 1000120 while logging request, client: 127.0.0.1, server: , request: “GET /get HTTP/1.1”, upstream: “https://34.193.132.77:443/get”, host: “localhost:20000”

The LD_PRELOAD could be also enabled globally via /etc/ld.so.preload. So it’s also non-intrusive for docker usage.

ld.so.preload

/opt/ngx-lua-request-time/ngx_lua_request_time.so
1
2
3
4
5
6
7
docker run --rm --name openresty \
-v $PWD/server.conf:/opt/bitnami/openresty/nginx/conf/server_blocks/my_server_block.conf:ro \
-v $PWD/../../ld.so.preload:/etc/ld.so.preload:ro \
-v $PWD/../../ngx_lua_request_time.so:/opt/ngx-lua-request-time/ngx_lua_request_time.so:ro \
-e NGX_LUA_REQUEST_TIME_VAR_NAME=openresty_request_time_us \
-p 20000:20000 \
bitnami/openresty:latest

design

It’s worth noting to use LD_PRELOAD, it needs to satisfy below conditions:

  • The hooked C symbol is public, and the dependent does not link it via offset, i.e. the other symbols in the same elf file use offset to locate this target symbol, although it’s public.
  • the function input arguments and return value contain enough information to do your analysis
  • it’s possible to filter unneeded contexts because some low-level function is used by many high-level logics, and some of them are noise

OpenResty enables to inject handlers in different phases of the request life cycle. All async phases (that is, such phase runs your lua code in a coroutine, which could yield for async IO and resume later) use lua_resume() function, and all recursive coroutine resumes are flattened by OpenResty, i.e. when you create a new coroutine to access the LDAP server via cosocket from the async phase handler, it will exit the entry coroutine (the handler itself) and resumes the extra coroutine there. So, calculating the duration in lua_resume() will not overlap with each other.

Note that it excludes the time spent on async IO! It only contains the pure Lua code execution time.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
int lua_resume(lua_State *L, int narg) {
    if (!is_enabled()) {
        return lua_resume_f(L, narg);
    }

    ngx_http_request_t *r = ngx_http_lua_get_request_f(L);
    if (!r) {
        return lua_resume_f(L, narg);
    }

    char* err;
    int phase = ngx_http_lua_ffi_get_phase_f(r, &err);
    if (phase == NGX_ERROR ||
		(phase != PHASE_SET &&
			phase != PHASE_REWRITE &&
			phase != PHASE_ACCESS &&
			phase != PHASE_CONTENT)) {
        return lua_resume_f(L, narg);
    }

    timespec tv1;
    clock_gettime(CLOCK_MONOTONIC, &tv1);

    int ret = lua_resume_f(L, narg);

    timespec tv2;
    clock_gettime(CLOCK_MONOTONIC, &tv2);

    record_time(&tv1, &tv2, L);

    return ret;
}

All sync phases, e.g. balancer phase, and header/body filter phase use lua_pcall(), but this is a very low-level function, so to minimize the overhead, you need a cheap method to filter the noise.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
int lua_pcall(lua_State *L, int nargs, int nresults, int errfunc) {
    if (!is_enabled()) {
        return lua_pcall_f(L, nargs, nresults, errfunc);
    }

    // all phases uses ngx_http_lua_traceback as the pcall error handler,
    // so that's a great point used to filter out the other `lua_pcall` invocations.
    // And, this checking is cheap!
    if (lua_tocfunction_f(L, 1) != ngx_http_lua_traceback_f) {
        return lua_pcall_f(L, nargs, nresults, errfunc);
    }
    ...
}

After testing, the CPU overhead is small enough to be ignored.