Luajit String Interning

foo

Why string interning?

String Interning is a common optimization in most of programming languages.

As the wiki said:

In computer science, string interning is a method of storing only one copy of each distinct string value, which must be immutable.[1] Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool.

It has below advantages in luajit:

compare string via pointers directly
use string as table key efficiently, because the hash is calculated at alloc stage and use pointer as uniqueness
save memory for duplicated strings

Drawback

Meanwhile, it brings in extra time-consuming tasks:

Hash
when collision, memcmp all strings in the same hash slot
rehash when collision is high and/or load factor is low

No matter whether you want to use string as table key or not, whether you want to do comparison or not, it would do string interning for all string allocations. So it would not suitable for below usecase:

You want to cache strings in openresty, e.g. save most frequent used redis key-values in memory. In this case, it would cause a lot of CPU in memcmp and rehash, especially when the size of value is big and similar.

When I work in Kugou Music, we save a lot of file segments in redis cluster (each segment is 1~4 MBytes in average). When we need to cache them in memory or shared memory, the string interning blocks our way.

Someone would say that why not store string in ffi cdata? In fact, you could not avoid storing data in string, especially in OpenResty, because almost all data comes from cosocket, and cosocket is implemented in standard lua API and only support string, no ffi there. That is, no matter which data source, even from C ffi, you have to convert it into string before you could communicate with the outer world.

In contrast, most of mainstream programming languages only do string interning for literals or constant expression.

For example, Python only interns for “Variables, Constants, and Function Names” and “Dictionary Keys”. It would distinguish whether the string is interned or not, and you could intern it explicitly.

Check the source code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44


/* Intern a string and return string object. */
GCstr *lj_str_new(lua_State *L, const char *str, size_t lenx)
{
  global_State *g = G(L);
  if (lenx-1 < LJ_MAX_STR-1) {
    MSize len = (MSize)lenx;
    StrHash hash = hash_sparse(g->str.seed, str, len);
    MSize coll = 0;
    int hashalg = 0;
    /* Check if the string has already been interned. */
    GCobj *o = gcref(g->str.tab[hash & g->str.mask]);
#if LUAJIT_SECURITY_STRHASH
    if (LJ_UNLIKELY((uintptr_t)o & 1)) {  /* Secondary hash for this chain? */
      hashalg = 1;
      hash = hash_dense(g->str.seed, hash, str, len);
      o = (GCobj *)(gcrefu(g->str.tab[hash & g->str.mask]) & ~(uintptr_t)1);
    }
#endif
    while (o != NULL) {
      GCstr *sx = gco2str(o);
      if (sx->hash == hash && sx->len == len) {
	if (memcmp(str, strdata(sx), len) == 0) {
	  if (isdead(g, o)) flipwhite(o);  /* Resurrect if dead. */
	  return sx;  /* Return existing string. */
	}
	coll++;
      }
      coll++;
      o = gcnext(o);
    }
#if LUAJIT_SECURITY_STRHASH
    /* Rehash chain if there are too many collisions. */
    if (LJ_UNLIKELY(coll > LJ_STR_MAXCOLL) && !hashalg) {
      return lj_str_rehash_chain(L, hash, str, len);
    }
#endif
    /* Otherwise allocate a new string. */
    return lj_str_alloc(L, str, len, hash, hashalg);
  } else {
    if (lenx)
      lj_err_msg(L, LJ_ERR_STROV);
    return &g->strempty;
  }
}

First, it would hash the string with hash_sparse. This function only picks some bytes from the string and do a quick hash and determine the hash slot in the hash array. It’s fast but the hash collision is high.

Second, compare with each string in the hash slot (linked list), and try to find the target. If there is too many collision in this step, it would switch to use hash_dense to rehash all strings in that hash slot, and mark them in dense mode (if new string comes in the first step and found dense hash slot, it would hash it with hash_dense in the early first step).

hash_dense iterates all characters of the strings to calculate the hash value, which is more expensive than hash_sparse.

Finally, if the same string is found in the pool, just returns. Otherwise, it would allocate a new string object. At this step, it may enlarge the hash array size if the load factor is not 100%.

Now, we get possible time-consuming places:

if your have many similar strings, then hash_sparse would results in the same hash value, then causes high hash collisions, do a lot of expensive memcmp and switch to use expensive hash_dense
When the number of strings increases, luajit would reallocate a array vector (twice of the old size), and keep in mind that it would rehash all exist strings in the current table!

So avoid similar and big strings and warm up the cache at startup.

Analyze the overhead of string interning

How expensive is memcmp?

memcmp_perf.c

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


    ...
    const int SZ = 1024*1024*1;
    char* foo = malloc(SZ);
    randstr(foo, SZ);
    char* bar = malloc(SZ);
    memcpy(bar, foo, SZ);
    bar[SZ-1] = 'A';
    int cnt = 0;
    struct timeval  tv1, tv2;
    gettimeofday(&tv1, NULL);
    for (int i = 0; i < atoi(argv[1]); i++) {
        cnt += memcmp(foo, bar, SZ);
    }
    gettimeofday(&tv2, NULL);
    ...

In a signle CPU (Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz), for each string with 1MB, only the last char is different, it could only do memcmp 90 times per second at most.

How expensive is hash collision?

We construct a workload to simulate the cache usecase (only set kv, no get).

Each http request inserts a key-value in the lua table. The key is integer and the value is string of 1MB size. Note that the key is not string, but luajit still does string interning for each string.

The http client would choose to use unique random strings or similar strings.

We use systemtap script to check the lj_str_new time cost differences.

nginx.conf

https://gist.github.com/kingluo/ca30b8f3b76d2af95f98dde93b91d049#file-nginx-conf

error_log /dev/stderr info;
worker_processes 1;

events {}

http {
    server {
        client_max_body_size 2M;
        client_body_buffer_size 2M;
        listen 10000;

        location /run {
            content_by_lua_block {
                require("workload").run()
            }
        }

        location /clean {
            content_by_lua_block {
                require("workload").clean()
            }
        }
    }
}

workload.lua

https://gist.github.com/kingluo/ca30b8f3b76d2af95f98dde93b91d049#file-workload-lua

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40


local cjson = require("cjson")

local _M = {}

local MAX_CNT = 800
local DEL = 300

local cnt = 0
local cache = {}

function _M.run()
    ngx.req.read_body()
    local data = ngx.req.get_body_data()
    assert(data)

    local kv = cjson.decode(data)
    if not cache[kv.k] then
        cnt = cnt + 1
        if cnt > MAX_CNT then
            local del = DEL
            cnt = cnt - del
            for k, v in pairs(cache) do
                cache[k] = nil
                del = del - 1
                if del == 0 then
                    break
                end
            end
        end
    end
    cache[kv.k] = kv.v
end

function _M.clean()
    cache = {}
    cnt = 0
    collectgarbage()
end

return _M

bench.go

https://gist.github.com/kingluo/ca30b8f3b76d2af95f98dde93b91d049#file-bench-go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


...
if *uniqFlag {
    kv.Value = randomString(MAX_STR_SIZE)
} else {
    buf[j%len(buf)] = byte(int('a') + j)
    kv.Value = string(buf)
}
data, err := json.Marshal(kv)
if err != nil {
    panic(err)
}
start := time.Now()
resp, err := http.Post(url, "text/plain", bytes.NewBuffer(data))
if err != nil {
    panic(err)
}
if resp.StatusCode != 200 {
    panic("post failed")
}
elapsed := time.Now().Sub(start)
...

lj_str_new.sxx

https://gist.github.com/kingluo/ca30b8f3b76d2af95f98dde93b91d049#file-lj_str_new-sxx

...
probe process("$^libluajit_path").function("lj_str_new") {
    if (tid() == target()) {
        start_time = gettimeofday_us()
    }
}

probe process("$^libluajit_path").function("lj_str_new").return {
    if (tid() == target()) {
        if (start_time) {
            latencies <<< gettimeofday_us() - start_time
        }
    }
}
...

Test

Let’s check the scenario of similar strings first:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


# In terminal1
/usr/local/openresty/nginx/sbin/nginx -p $PWD -c nginx.conf -g "daemon off;"

# In terminal2
# get the pid of nginx worker process with `ps auxf |grep nginx`, e.g. 172147
./samples/lj_str_new.sxx -D STP_NO_OVERLOAD --arg time=30 --skip-badvars -x 172147
...
min/avg/max: 5/813/78809 us

# In terminal3
go run bench.go
...
avg rtt: 22014 microsec

The average of lj_str_new is almost 1ms!

Now, check the scenario of unique strings:

1
2
3
4
5
6
7
8


./samples/lj_str_new.sxx -D STP_NO_OVERLOAD --arg time=60 --skip-badvars -x 172147
...
min/avg/max: 5/653/5996 us


go run bench.go -uniq
...
avg rtt: 13745 microsec

The unique random strings cause less hash collision and so that less memcmp, so the result is better.

Could we do dynamic string interning?

We could do string interning only when the string is used to:

use as table key
string comparison

In fact, the lua official has an idea too: http://lua-users.org/wiki/SpeedingUpStrings

However, you need to modify a lot of codes in luajit.

Interpreter

luajit uses dynasm to generate interpreter for different CPU Architecture. So you need to change each interpreter file.

Take vm_x86.dasc as example:

1
2
3
4
5
6
7


    if (op == BC_ISEQV || op == BC_ISNEV) {
      ...
      |  // Same types and not a primitive type. Compare GCobj or pvalue.
      |  mov RA, [BASE+RA*8]
      |  mov RD, [BASE+RD*8]
      |  cmp RA, RD
      |  je <1				// Same GCobjs or pvalues?

You could see that it uses the address to do comparison directly.

Table key

Note that here strV gets address of the string object.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


TValue *lj_tab_setstr(lua_State *L, GCtab *t, const GCstr *key)
{
  TValue k;
  Node *n = hashstr(t, key);
  do {
    if (tvisstr(&n->key) && strV(&n->key) == key)
      return &n->val;
  } while ((n = nextnode(n)));
  setstrV(L, &k, key);
  return lj_tab_newkey(L, t, &k);
}

At sweep stage of GC (gc_sweepstr), it would iterate the string pool to remove all dangle objects. If we do not intern the strings, then those strings must be stored elsewhere and handle it in GC too. Moreover, we need an extra pointer to point to the main GC object, and flag to check the object if interned.

Tracing and compilation

This is the most difficult part, check lj_obj_equal and lj_record_ins.

There may be other trivial places to change.

All in all, luajit assumes and hardcodes string interning everywhere. It is a tough job to optimize.

Conclusion

String interning in luajit is rough and straightforward. You must be careful to avoid similar strings and hash collision to get better performance. If luajit supports dynamic string interning in future, it would be great.