Why string interning?
String Interning is a common optimization in most of programming languages.
As the wiki said:
In computer science, string interning is a method of storing only one copy of each distinct string value, which must be immutable.[1] Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool.
It has below advantages in luajit:
- compare string via pointers directly
- use string as table key efficiently, because the hash is calculated at alloc stage and use pointer as uniqueness
- save memory for duplicated strings
Drawback
Meanwhile, it brings in extra time-consuming tasks:
- Hash
- when collision,
memcmp
all strings in the same hash slot - rehash when collision is high and/or load factor is low
No matter whether you want to use string as table key or not, whether you want to do comparison or not, it would do string interning for all string allocations. So it would not suitable for below usecase:
You want to cache strings in openresty, e.g. save most frequent used redis key-values in memory.
In this case, it would cause a lot of CPU in memcmp
and rehash
, especially when the size of value is big and similar.
When I work in Kugou Music, we save a lot of file segments in redis cluster (each segment is 1~4 MBytes in average). When we need to cache them in memory or shared memory, the string interning blocks our way.
Someone would say that why not store string in ffi cdata? In fact, you could not avoid storing data in string, especially in OpenResty, because almost all data comes from cosocket, and cosocket is implemented in standard lua API and only support string, no ffi there. That is, no matter which data source, even from C ffi, you have to convert it into string before you could communicate with the outer world.
In contrast, most of mainstream programming languages only do string interning for literals or constant expression.
For example, Python only interns for “Variables, Constants, and Function Names” and “Dictionary Keys”. It would distinguish whether the string is interned or not, and you could intern it explicitly.
Check the source code
|
|
First, it would hash the string with hash_sparse
. This function only picks some bytes from the string
and do a quick hash and determine the hash slot in the hash array. It’s fast but the hash collision is high.
Second, compare with each string in the hash slot (linked list), and try to find the target.
If there is too many collision in this step, it would switch to use hash_dense
to rehash all strings in that
hash slot, and mark them in dense mode (if new string comes in the first step and found dense hash slot, it would
hash it with hash_dense
in the early first step).
hash_dense
iterates all characters of the strings to calculate the hash value, which is more expensive than hash_sparse
.
Finally, if the same string is found in the pool, just returns. Otherwise, it would allocate a new string object. At this step, it may enlarge the hash array size if the load factor is not 100%.
Now, we get possible time-consuming places:
- if your have many similar strings, then
hash_sparse
would results in the same hash value, then causes high hash collisions, do a lot of expensivememcmp
and switch to use expensivehash_dense
- When the number of strings increases, luajit would reallocate a array vector (twice of the old size), and keep in mind that it would rehash all exist strings in the current table!
So avoid similar and big strings and warm up the cache at startup.
Analyze the overhead of string interning
How expensive is memcmp?
|
|
In a signle CPU (Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz), for each string with 1MB, only the last char is different, it could only do memcmp
90 times per second at most.
How expensive is hash collision?
We construct a workload to simulate the cache usecase (only set kv, no get).
Each http request inserts a key-value in the lua table. The key is integer and the value is string of 1MB size. Note that the key is not string, but luajit still does string interning for each string.
The http client would choose to use unique random strings or similar strings.
We use systemtap script to check the lj_str_new
time cost differences.
nginx.conf
https://gist.github.com/kingluo/ca30b8f3b76d2af95f98dde93b91d049#file-nginx-conf
error_log /dev/stderr info;
worker_processes 1;
events {}
http {
server {
client_max_body_size 2M;
client_body_buffer_size 2M;
listen 10000;
location /run {
content_by_lua_block {
require("workload").run()
}
}
location /clean {
content_by_lua_block {
require("workload").clean()
}
}
}
}
workload.lua
https://gist.github.com/kingluo/ca30b8f3b76d2af95f98dde93b91d049#file-workload-lua
|
|
bench.go
https://gist.github.com/kingluo/ca30b8f3b76d2af95f98dde93b91d049#file-bench-go
|
|
lj_str_new.sxx
https://gist.github.com/kingluo/ca30b8f3b76d2af95f98dde93b91d049#file-lj_str_new-sxx
...
probe process("$^libluajit_path").function("lj_str_new") {
if (tid() == target()) {
start_time = gettimeofday_us()
}
}
probe process("$^libluajit_path").function("lj_str_new").return {
if (tid() == target()) {
if (start_time) {
latencies <<< gettimeofday_us() - start_time
}
}
}
...
Test
Let’s check the scenario of similar strings first:
|
|
The average of lj_str_new
is almost 1ms!
Now, check the scenario of unique strings:
|
|
The unique random strings cause less hash collision and so that less memcmp
, so the result is better.
Could we do dynamic string interning?
We could do string interning only when the string is used to:
- use as table key
- string comparison
In fact, the lua official has an idea too: http://lua-users.org/wiki/SpeedingUpStrings
However, you need to modify a lot of codes in luajit.
- Interpreter
luajit uses dynasm to generate interpreter for different CPU Architecture. So you need to change each interpreter file.
Take vm_x86.dasc
as example:
|
|
You could see that it uses the address to do comparison directly.
- Table key
Note that here strV
gets address of the string object.
|
|
- GC
At sweep stage of GC (gc_sweepstr
), it would iterate the string pool to remove all dangle objects.
If we do not intern the strings, then those strings must be stored elsewhere and handle it in GC too.
Moreover, we need an extra pointer to point to the main GC object, and flag to check the object if interned.
- Tracing and compilation
This is the most difficult part, check lj_obj_equal
and lj_record_ins
.
There may be other trivial places to change.
All in all, luajit assumes and hardcodes string interning everywhere. It is a tough job to optimize.
Conclusion
String interning in luajit is rough and straightforward. You must be careful to avoid similar strings and hash collision to get better performance. If luajit supports dynamic string interning in future, it would be great.