TL;DR
How to record the lua execution time (excludes the time spent on async IO) per request in ngx.ctx
, in a non-intrusive and zero-cost way?
The lua execution time is the sum of the following metrics:
lua_resume
- phases
- REWRITE
- ACCESS
- CONTENT
- BALANCER
- phases
lua_pcall
- filters
- header_filter
- body_filter
- filters
https://github.com/kingluo/ngx-lua-request-time
Issue to measure request time
Nginx has two variables to record request time:
$request_time
request processing time in seconds with a milliseconds resolution (1.3.9, 1.2.6); time elapsed since the first bytes were read from the client
$upstream_response_time
keeps time spent on receiving the response from the upstream server; the time is kept in seconds with millisecond resolution. Times of several responses are separated by commas and colons like addresses in the $upstream_addr variable.
Nginx is a multi-processes and single-threading model. The main thread of each worker process must be fully async-processing, otherwise, the performance is bad.
In OpenResty, our business logic is built mainly in lua code. If the code makes full use of async API, e.g. cosocket, run_worker_thread, then it meets the requirements of high-performance processing. However, in reality, the code may be huge and complex, and sometimes the downstream/upstream network goes wrong, and sometimes the downstream client and the upstream server may suck in the socket r/w.
So how to distinguish how much time we spend on lua code? How much time to handle the network r/w? And whether the lua code is blocking or not? Could we use $request_time - $upstream_response_time
. Unfortunately, the answer is no!
Let’s have a look at the request flow of nginx first:
The most misunderstanding of nginx is that you thought nginx will read the whole response body from upstream before it sends the body to the downstream, as well as the request body, so the $upstream_response_time
is the pure upstream traffic time-cost.
It’s completely wrong! In fact, the traffic pipe between upstream and downstream is done dynamically and incrementally.
Moreover, nginx will not read the request body from the downstream by default. Instead, when the headers are read, it triggers the phase handlers to handle this request, and after the balancer phase chooses an upstream node, it starts to read the request body and send it to the upstream gradually.
So let’s clarify those variables again:
$request_time
is the whole request duration, including the traffic time-cost between nginx and upstream/downstream. And it changes each time you read it, because when you read it, it uses the current timestamp (also cached one) to calculate the duration. It’s actually the time-cost fromstep1
tostep6
.$upstream_response_time
is actually the time-cost fromstep2
tostep6
:
- if
proxy_request_buffering
is on, the$upstream_response_time
is the time cost between openresty and upstream. - if
proxy_request_buffering
is off, the$upstream_response_time
is the time cost between openresty and upstream, plus the time cost reading the request body from downstream.
If we enable proxy buffering, it’s more complicated. The upstream may be done early, before the slow downstream get done. Anyways, $upstream_response_time
contains time-cost overlapp of upstream and downstream, may be part of, may be the whole.
proxy_buffering on (by default)
In most use cases, the upstream is in the local network, so the network rate is faster than the downstream. Then we need to use buffering as much as possible so that the upstream could recycle the resource to handle the next request.
Nginx will try to read the response body from the upstream as much as possible into the memory buffers or even disk files, in parallel to writing to the downstream by draining the buffers.
If the downstream is too slow to sync with the upstream, then nginx will close the upstream first. In this case, $upstream_response_time
is the time-cost of the upstream traffic and partial downstream traffic.
We could setup a demo to visualize this process.
mock_get.go
Send a GET
, but do not receive the response.
|
|
mock_get_big_respnose.go
Send a big response (4MB body, ensure it is over the sum of nginx buffers and kernel sockbuf).
|
|
nginx.conf
daemon off;
error_log /dev/stderr info;
events {}
http {
upstream backend {
server 127.0.0.1:3333;
keepalive 320;
keepalive_requests 1000;
keepalive_timeout 60s;
}
log_format main escape=default '$remote_addr - $remote_user [$time_local] $http_host "$request" $status $body_bytes_sent $request_time "$http_referer" "$http_user_agent" $upstream_addr $upstream_status $upstream_response_time';
access_log logs/access.log main;
send_timeout 10s;
server {
listen 20000;
location / {
proxy_pass http://backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
}
}
|
|
Watch logs/access.log
:
127.0.0.1 - - [20/Apr/2023:18:39:13 +0800] httpbin.org "GET /anything/a HTTP/1.1" 200 2555860 10.016 "-" "-" 127.0.0.1:3333 200 0.015
You could see that $upstream_response_time
is 0.015
secs, and the $request_time
is 10.016
secs.
TCP flow between the upstream and nginx:
$upstream_response_time
is the time-cost from step1
to step2
only.
Check the nginx source code:
|
|
proxy_buffering off
Modify nginx.conf
:
...
location / {
proxy_pass http://backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_buffering off;
}
...
Watch logs/access.log
:
127.0.0.1 - - [20/Apr/2023:18:48:46 +0800] httpbin.org "GET /anything/a HTTP/1.1" 200 4157440 10.022 "-" "-" 127.0.0.1:3333 200 10.021
You could see that $upstream_response_time
is 10.021
secs, and the $request_time
is 10.022
secs, which are almost the same.
Now $upstream_response_time
fully covers the downstream traffic, i.e. including the step3
.
Solution
We need a metric to calculate the execution time purely on lua code, excluding other factors. So that we could ensure our business logic is fully async.
We have a few solutions:
- systemtap
- heavyweight, you need to install debug symbols, sometimes it’s not feasible, e.g. lack of symbols for custom kernel
- Non-negligible overhead, at least 5% performance impact
- not container friendly
- could not write the userspace, i.e. could not set lua variable
- record and calculate in lua code manually
- it’s hard and tricky to handle lua errors and coroutine yield
- code pollution
Could we have a non-intrusive and zero-cost way to record the lua execution time?
Maybe LD_PRELOAD
is a better choice.
https://github.com/kingluo/ngx-lua-request-time
demo
Use the sleep function from libc to do blocking sleep for 1 second, and log the ngx.ctx.openresty_request_time_us
to confirm it’s blocked by 1 second.
nginx.conf
server {
error_log /dev/stderr info;
listen 20000;
location / {
access_by_lua_block {
local ffi = require("ffi")
ffi.cdef[[unsigned int sleep(unsigned int seconds);]]
ffi.C.sleep(1);
}
proxy_pass https://httpbin.org;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
log_by_lua_block {
ngx.log(ngx.WARN, "ngx.ctx.openresty_request_time_us: ", ngx.ctx.openresty_request_time_us)
}
}
Test:
|
|
Output:
2023/04/16 09:59:43 [warn] 2868937#2868937: *3 [lua] log_by_lua(server.conf:17):2: ngx.ctx.openresty_request_time_us: 1000120 while logging request, client: 127.0.0.1, server: , request: “GET /get HTTP/1.1”, upstream: “https://34.193.132.77:443/get”, host: “localhost:20000”
The LD_PRELOAD could be also enabled globally via /etc/ld.so.preload
. So it’s also non-intrusive for docker usage.
ld.so.preload
/opt/ngx-lua-request-time/ngx_lua_request_time.so
|
|
design
It’s worth noting to use LD_PRELOAD, it needs to satisfy below conditions:
- the hooked C symbol is public, and the dependent does not link it via offset, i.e. the other symbols in the same elf file use offset to locate this target symbol, although it’s public.
- the function input arguments and return value contains enough information to do your analysis
- it’s possible to filter unneeded contexts because some low-level function is used by many high-level logics, and some of them are noise
In OpenResty, it enables to inject handlers in different phases of the request life cycle.
All async phases (that is, such phase runs your lua code in a coroutine,
which could yield for async IO and resume later) uses lua_resume()
function, and all recursive coroutine resumes are flattened by
OpenResty, i.e. when you create a new coroutine to access LDAP server via cosocket from the async phase handler, it will
exit the entry coroutine (the handler itself) and resumes the extra coroutine there. So, calculate the duration in
lua_resume()
will be not overlapped with each other.
Note that it excludes the time spent on async IO! It only contains the pure lua code execution time.
|
|
All sync phases, e.g. balancer phase, header/body filter phase uses lua_pcall()
, but this is a very low-level function,
so to minimize the overhead, you need a cheap method to filter the noise.
|
|
After testing, the cpu overhead is small enough to be ignored.