TL;DR
How can I log the Lua execution time of each request (excluding time spent on asynchronous IO) in ngx.ctx in a non-intrusive, zero-cost way?
The Lua execution time is the sum of the following metrics:
lua_resume
- phases
- REWRITE
- ACCESS
- CONTENT
- BALANCER
- phases
lua_pcall
- filters
- header_filter
- body_filter
- filters
https://github.com/kingluo/ngx-lua-request-time
Issue to measure request time
Nginx has two variables to record request time:
$request_time
request processing time in seconds with a milliseconds resolution (1.3.9, 1.2.6); time elapsed since the first bytes were read from the client
$upstream_response_time
keeps time spent on receiving the response from the upstream server; the time is kept in seconds with millisecond resolution. Times of several responses are separated by commas and colons like addresses in the $upstream_addr variable.
Nginx is a multi-processes and single-threading model. The main thread of each worker process must be fully async-processing, otherwise, the performance is bad.
In OpenResty, our business logic is built mainly in Lua code. If the code makes full use of async API, e.g. cosocket, run_worker_thread, then it meets the requirements of high-performance processing. However, in reality, the code may be huge and complex, and sometimes the downstream/upstream network goes wrong, and sometimes the downstream client and the upstream server may suck in the socket r/w.
So how to distinguish how much time we spend on Lua code?
How much time to handle the network r/w? And whether the lua code is blocking or not?
Could we use $request_time - $upstream_response_time
? Unfortunately, the answer is no!
Let’s first take a look at Nginx’s request life cycle:
The most common misunderstanding of nginx is that you thought nginx would read the whole response body from upstream before it sends the body to the downstream, as well as the request body, so the $upstream_response_time
is the pure upstream traffic time-cost.
But actually, it depends on the parameter configuration. By default, the traffic pipeline between the upstream and downstream of the request and response body is done dynamically and incrementally.
Moreover, nginx will not read the request body from the downstream before all following phases, e.g. rewrite phase. Instead, when the headers are read, it triggers the phase handlers to handle this request, and after the balancer phase chooses an upstream node, it starts to read the request body and send it to the upstream gradually.
Also note that after reading the request headers, Nginx stops reading the request body until the content phase completes or is explicitly triggered in the lua code.
So let’s clarify the time-cost relationship again:
proxy_request_buffering | proxy_buffering | $upstream_response_time |
$request_time - $upstream_response_time |
---|---|---|---|
on | on | upstream_time_cost | downstream_time_cost + lua_time_cost |
off | on | upstream_time_cost + downstream_request_body_time_cost | downstream_response_body_time_cost + lua_time_cost |
on | off | upstream_time_cost + downstream_response_body_time_cost | downstream_request_body_time_cost + lua_time_cost |
off | off | upstream_time_cost + downstream_time_cost | lua_time_cost |
You can see that the definition of $upstream_response_time
changes depending on the proxy_request_buffering and proxy_buffering settings.
So, the motivation is, how to measure Lua time cost without being affected by directives settings?
Important notes:
- Interestingly, you may find that request_time is smaller than upstream_response_time! how come?
Here is an example:
127.0.0.1 - - [15/Apr/2023:20:05:32 +0800] 127.0.0.1:9080 "GET /get HTTP/1.1" 200 463 0.001 "-" "curl/7.68.0" 172.30.74.11:80 200 0.002 "http://127.0.0.1:9080"
$request_time
is 0.001, while $upstream_response_time
is 0.002!
This is caused by the two metrics using different timing sources.
$upstream_response_time calculated by clock_gettime(CLOCK_MONOTONIC_COARSE), and by default it can be in past for 4 milliseconds, on opposite, $request_time calculated by gettimeofday(). So eventually upstream_response_time might be larger, than response_time.
$request_time
is generated in real-time, so different values may be obtained at different times.- The access log is written after the Lua log phase handler is completed, so there should be no blocking tasks in the Lua code, otherwise the
$request_time
in the access log will be larger than the actual value.
proxy_buffering on (by default)
In most use cases, the upstream is in the local network, so the network rate is faster than the downstream. Then we need to use buffering as much as possible so that the upstream could recycle the resource to handle the next request.
Nginx will try to read the response body from the upstream as much as possible into the memory buffers or even disk files, in parallel to writing to the downstream by draining the buffers.
If the downstream is too slow to sync with the upstream, then nginx will close the upstream first. In this case, $upstream_response_time
is the time-cost of the upstream traffic and partial downstream traffic.
We could set up a demo to visualize this process.
mock_get.go
Send a GET
, but do not receive the response.
|
|
mock_get_big_respnose.go
Send a big response (4MB body, ensure it is over the sum of nginx buffers and kernel sockbuf).
|
|
nginx.conf
daemon off;
error_log /dev/stderr info;
events {}
http {
upstream backend {
server 127.0.0.1:3333;
keepalive 320;
keepalive_requests 1000;
keepalive_timeout 60s;
}
log_format main escape=default '$remote_addr - $remote_user [$time_local] $http_host "$request" $status $body_bytes_sent $request_time "$http_referer" "$http_user_agent" $upstream_addr $upstream_status $upstream_response_time';
access_log logs/access.log main;
send_timeout 10s;
server {
listen 20000;
location / {
proxy_pass http://backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
}
}
|
|
Watch logs/access.log
:
127.0.0.1 - - [20/Apr/2023:18:39:13 +0800] httpbin.org "GET /anything/a HTTP/1.1" 200 2555860 10.016 "-" "-" 127.0.0.1:3333 200 0.015
You can see that $upstream_response_time
is 0.015
secs, and the $request_time
is 10.016
secs.
TCP flow between the upstream and nginx:
$upstream_response_time
is the time-cost from step1
to step2
only.
Check the nginx source code:
|
|
proxy_buffering off
Modify nginx.conf
:
...
location / {
proxy_pass http://backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_buffering off;
}
...
Watch logs/access.log
:
127.0.0.1 - - [20/Apr/2023:18:48:46 +0800] httpbin.org "GET /anything/a HTTP/1.1" 200 4157440 10.022 "-" "-" 127.0.0.1:3333 200 10.021
You can see that $upstream_response_time
is 10.021
secs, and the $request_time
is 10.022
secs, which are almost the same.
Now $upstream_response_time
fully covers the downstream traffic, i.e. including the step3
.
Solution
We need a metric to calculate the execution time purely on lua code, excluding other factors. So that we could ensure our business logic is fully async.
We have a few solutions:
- systemtap
- heavyweight, you need to install debug symbols, sometimes it’s not feasible, e.g. lack of symbols for custom kernel
- Non-negligible overhead, at least 5% performance impact
- not container-friendly
- could not write the userspace, i.e. could not set the lua variable
- record and calculate in Lua code manually
- it’s hard and tricky to handle lua errors and coroutine yield
- code pollution
Could we have a non-intrusive and zero-cost way to record the Lua execution time?
Maybe LD_PRELOAD
is a better choice.
https://github.com/kingluo/ngx-lua-request-time
demo
Use the sleep function from libc to do blocking sleep for 1 second, and log the ngx.ctx.openresty_request_time_us
to confirm it’s blocked by 1 second.
nginx.conf
server {
error_log /dev/stderr info;
listen 20000;
location / {
access_by_lua_block {
local ffi = require("ffi")
ffi.cdef[[unsigned int sleep(unsigned int seconds);]]
ffi.C.sleep(1);
}
proxy_pass https://httpbin.org;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
log_by_lua_block {
ngx.log(ngx.WARN, "ngx.ctx.openresty_request_time_us: ", ngx.ctx.openresty_request_time_us)
}
}
Test:
|
|
Output:
2023/04/16 09:59:43 [warn] 2868937#2868937: *3 [lua] log_by_lua(server.conf:17):2: ngx.ctx.openresty_request_time_us: 1000120 while logging request, client: 127.0.0.1, server: , request: “GET /get HTTP/1.1”, upstream: “https://34.193.132.77:443/get”, host: “localhost:20000”
The LD_PRELOAD could be also enabled globally via /etc/ld.so.preload
. So it’s also non-intrusive for docker usage.
ld.so.preload
/opt/ngx-lua-request-time/ngx_lua_request_time.so
|
|
design
It’s worth noting to use LD_PRELOAD, it needs to satisfy below conditions:
- The hooked C symbol is public, and the dependent does not link it via offset, i.e. the other symbols in the same elf file use offset to locate this target symbol, although it’s public.
- the function input arguments and return value contain enough information to do your analysis
- it’s possible to filter unneeded contexts because some low-level function is used by many high-level logics, and some of them are noise
OpenResty enables to inject handlers in different phases of the request life cycle.
All async phases (that is, such phase runs your lua code in a coroutine,
which could yield for async IO and resume later) use lua_resume()
function, and all recursive coroutine resumes are flattened by
OpenResty, i.e. when you create a new coroutine to access the LDAP server via cosocket from the async phase handler, it will
exit the entry coroutine (the handler itself) and resumes the extra coroutine there. So, calculating the duration in
lua_resume()
will not overlap with each other.
Note that it excludes the time spent on async IO! It only contains the pure Lua code execution time.
|
|
All sync phases, e.g. balancer phase, and header/body filter phase use lua_pcall()
, but this is a very low-level function,
so to minimize the overhead, you need a cheap method to filter the noise.
|
|
After testing, the CPU overhead is small enough to be ignored.