Learn Async Rust From Assembly Code

hydra

Asynchronous programming or async for short, is a concurrent programming model supported by an increasing number of programming languages. It lets you run a large number of concurrent tasks on a small number of OS threads, while preserving much of the look and feel of ordinary synchronous programming, through the async/await syntax.

How rust implements the asynchronous programming?

Useful references:

https://rust-lang.github.io/async-book/01_getting_started/01_chapter.html

https://cfsamson.github.io/books-futures-explained/introduction.html

OS threading

The most common and easy way is OS threading.

But the most disadvantage is resource consumption, especially context switching overhead.

Green thread (stackful coroutine)

Just like Golang, rust provides green thread before it hits 1.0.

Callback/Promise

Just like Netty and Javascript, the callback based dispatching is a main method to implement async.

But callback hell, fragmented program logic and data sharing are the nightmare.

Async/Await (stackless coroutine)

Just like Python and Javascript, async/await is a popular async way.

Advantages:

  • The programming logics are linear, which is line with human thinking habits
  • No need to manage stack grow and shrink
  • No need to save CPU state
  • Reuse generator mechanism

Future trait

The Future trait is at the center of asynchronous programming in Rust.

A Future is an asynchronous computation that can produce a value.

1
2
3
4
5
pub trait Future {
    type Output;

    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output>;
}

So, all implementations of async runtime, e.g. tokio, take Future as the schedule unit:

1
2
3
4
5
// Function tokio::spawn
pub fn spawn<T>(future: T) -> JoinHandle<T::Output>
where
    T: Future + Send + 'static,
    T::Output: Send + 'static,

Future itself provides Promise style async programming, aka combinators.

In fact, before async/await borns, the crate Futures 0.1 used combinators.

Example:

1
2
3
4
5
6
7
let future = Connection::connect(conn_str).and_then(|conn| {
    conn.query("somerequest").map(|row|{
        SomeStruct::from(row)
    }).collect::<Vec<SomeStruct>>()
});

let rows: Result<Vec<SomeStruct>, SomeLibraryError> = block_on(future);

Moreover, you can wrap futures into a high-level future, to implement new logic.

For example, write a Select top-level Future to run two sub-Futures simultaneously:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
pub struct Select<A, B> {
    inner: Option<(A, B)>,
}

impl<A, B> Future for Select<A, B>
where
    A: Future + Unpin,
    B: Future + Unpin,
{
    type Output = Either<(A::Output, B), (B::Output, A)>;

    fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
        let (mut a, mut b) = self.inner.take().expect("cannot poll Select twice");

        if let Poll::Ready(val) = a.poll_unpin(cx) {
            return Poll::Ready(Either::Left((val, b)));
        }

        if let Poll::Ready(val) = b.poll_unpin(cx) {
            return Poll::Ready(Either::Right((val, a)));
        }

        self.inner = Some((a, b));
        Poll::Pending
    }
}

How Async/Await looks like in assembly code?

Async function is just a more friendly presentation of Future. Combined with .await, you can compose different Futures together at ease.

Each async function would be compiled into a special generator, which implements Future trait.

State machine

In the compiler, generators are currently compiled as state machines. Each yield expression will correspond to a different state that stores all live variables over that suspension point. Resumption of a generator will dispatch on the current state and then execute internally until a yield is reached, at which point all state is saved off in the generator and a value is returned.

Let’s take a simple example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
use futures::executor::block_on;

async fn foo() -> i32 {
    return 32
}

async fn bar() -> i32 {
    return 32
}

async fn foobar() {
    let str = "foo".to_string();
    let val = foo().await;
    let str2 = "bar".to_string();
    let val2 = bar().await;
    println!("hello, world! &str: {:p}, &val: {:p} &str2: {:p}, &val2: {:p}", &str, &val, &str2, &val2);
}

fn main() {
    let future = foobar();
    block_on(future);
}

We have a top-level async function foobar(), which calls async functions foo() and bar().

Let’s use playground to compile the code into assembly code in debug mode.

Check the assembly code of foobar() generator:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
playground::foobar::{{closure}}:
	subq	$488, %rsp
	movq	%rsi, 176(%rsp)
	movq	%rdi, 192(%rsp)
	movq	%rsi, 416(%rsp)
	movq	192(%rsp), %rax
	movzbl	28(%rax), %eax
	movq	%rax, 184(%rsp)
	movq	184(%rsp), %rax
	leaq	.LJTI38_0(%rip), %rcx
	movslq	(%rcx,%rax,4), %rax
	addq	%rcx, %rax
	jmpq	*%rax
	ud2

State data structure

The first argument %rdi is the generator type itself, which contains state data, saved in the stack 192(%rsp).

This type has a number of states (represented here as an enum) corresponding to each of the conceptual states of the generator. At the beginning we’re closing over our outer variable foo and then that variable is also live over the yield point, so it’s stored in both states.

What’s the structure of the state data? Well, let’s check it in llvm IR:

1
2
3
4
5
6
%"[async fn body@src/main.rs:7:23: 9:2]" = type { i8 }
%"[async fn body@src/main.rs:3:23: 5:2]" = type { i8 }
%"[async fn body@src/main.rs:11:19: 17:2]::Suspend1" =
type { %"alloc::string::String", i32, [4 x i8], %"[async fn body@src/main.rs:7:23: 9:2]", [7 x i8], %"alloc::string::String" }
%"[async fn body@src/main.rs:11:19: 17:2]::Suspend0" =
type { %"alloc::string::String", [8 x i8], %"[async fn body@src/main.rs:3:23: 5:2]", [7 x i8] }

Suspend0 is the enum variant lives across the first yield point foo().await, and Suspend1 is the enum variant lives across the second yield point bar().await.

In my program, all local variables lives till the end, so Suspend1 is a superset of Suspend0, so it’s enough to check Suspend1. Note that embedded generator is stored in the parent generator, and because they only live before their yield points perspectively and of the same size, embedded generators reuse the same field of the foobar generator.

In assembly code, you can see that field mapping to variable is:

  • +0 : %"alloc::string::String" -> str
  • +24 : i32 -> val
  • +28 : [4 x i8] -> state discriminator
  • async fn body -> embedded generator for foo() or bar() 1 byte
  • [7 x i8] padding
  • +40 : %"alloc::string::String" -> str2

Note that val2 is a variable after the last yield point, so it’s on the stack but not in the state.

From the program output, you can confirm the address relationship too:

hello, world! &str: 0x7ffc9e68c9b8, &val: 0x7ffc9e68c9d0 &str2: 0x7ffc9e68c9e0, &val2: 0x7ffc9e68c6ec

State resume/yield

.LJTI38_0 is the memory label where stores the offsets of state resume entries.

1
2
3
4
5
6
.LJTI38_0:
	.long	.LBB38_2-.LJTI38_0 ; first resume entry
	.long	.LBB38_3-.LJTI38_0 ; resumed after completion
	.long	.LBB38_4-.LJTI38_0 ; resumed after panicking
	.long	.LBB38_5-.LJTI38_0 ; yield point1: foo().await
	.long	.LBB38_6-.LJTI38_0 ; yield point2: bar().await

The assembly code uses %rip relative addressing to locate the entry address:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
; get discriminator and save it in %rax
movq	192(%rsp), %rax
movzbl	28(%rax), %eax
...
; get .LJTI38_0 address
leaq	.LJTI38_0(%rip), %rcx
; get entry offset relative to .LJTI38_0, each offset value is in 32 bit size
; i.e. offset = *(%rcx + %rax * 4)
movslq	(%rcx,%rax,4), %rax
; get entry absolute address = .LJTI38_0 + offset
addq	%rcx, %rax

Let’s check .LBB38_2:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
.LBB38_2:
	movq	176(%rsp), %rax
	movq	%rax, 408(%rsp)
	movq	192(%rsp), %rdi
	leaq	.L__unnamed_12(%rip), %rsi
	movl	$3, %edx
	callq	<str as alloc::string::ToString>::to_string
	jmp	.LBB38_13
...
.LBB38_13:
	callq	playground::foo
	movb	%al, 159(%rsp)
	jmp	.LBB38_16

You can see that to_string() saves the result in 192(%rsp), where is field address of str.

Then it calls foo() to get and poll its generator.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
.LBB38_16:
	...
	callq	<F as core::future::into_future::IntoFuture>::into_future
	movb	%al, 158(%rsp)
	jmp	.LBB38_17

.LBB38_17:
	...
	jmp	.LBB38_8

.LBB38_18:
	movq	408(%rsp), %rdi
	callq	core::future::get_context
	movq	%rax, 144(%rsp)
	jmp	.LBB38_19

.LBB38_19:
	...
	; foo().await
	callq	playground::foo::{{closure}}
	movl	%edx, 136(%rsp)
	movl	%eax, 140(%rsp)
	jmp	.LBB38_20

.LBB38_20:
	movl	136(%rsp), %eax
	movl	140(%rsp), %ecx
	movl	%ecx, 224(%rsp)
	movl	%eax, 228(%rsp)
	movl	224(%rsp), %eax
	; check poll result
	; 0: ready
	; 1: pending
	cmpq	$0, %rax
	; if pending, yield
	jne	.LBB38_22
	; if ready, continue next statement
	; i.e. `let str2 = "bar".to_string();`
	movl	228(%rsp), %ecx
	movl	%ecx, 444(%rsp)
	movq	192(%rsp), %rax
	; set foo() result to `val`
	movl	%ecx, 24(%rax)
	; set result address to `str2`
	movq	192(%rsp), %rdi
	addq	$40, %rdi
	leaq	.L__unnamed_14(%rip), %rsi
	movl	$3, %edx
	callq	<str as alloc::string::ToString>::to_string
	; continue next statement
	jmp	.LBB38_23

.LBB38_22:
	...
	; foo() not finished yet,
	; set discriminator to 3, i.e. `.LBB38_5`,
	; so that next resume will re-run foo().
	movq	192(%rsp), %rax
	movb	$3, 28(%rax)
	...
	retq

After the completion of the foobar(), it sets the discriminator to 1.

1
2
3
4
5
6
.LBB38_41:
	...
	movq	192(%rsp), %rax
	movb	$1, 28(%rax)
	...
	retq

Offset 1 means the next resume will jump to .LBB38_3, which will panic the program.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
.LBB38_3:
	xorl	%eax, %eax
	testb	$1, %al
	jne	.LBB38_3
	jmp	.LBB38_10
...
.LBB38_10:
	leaq	str.0(%rip), %rdi
	leaq	.L__unnamed_13(%rip), %rdx
	movq	core::panicking::panic@GOTPCREL(%rip), %rax
	movl	$35, %esi
	callq	*%rax
	ud2
...
str.0:
	.ascii	"`async fn` resumed after completion"

Multi-variant layouts for generators

As said, each state of generator has its own variant structure layout. For non-overalpped fields of two consecutive states, memory will be reused.

In async function, the compiler needs to determine the local variables as stack variables or state fields, via liveness analysis from MIR.

  • stack variables will not be saved in the state
  • some state fields may be gone after this state.

Useful references:

https://tmandry.gitlab.io/blog/posts/optimizing-await-2/

https://github.com/rust-lang/rust/pull/59897

Let’s talk about some common cases.

Scoped

Normally, variables live until the end of a function, even if you do not use it after declaration. So, for temporary variables you don’t want to save in the state, you need to scope them.

For example, if you don’t use str after printing it, you need to scope it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
async fn foobar() {
    {
        let str = "foo".to_string();
        println!("{}", str);
    }
    let val = foo().await;
    let str2 = "bar".to_string();
    let val2 = bar().await;
    println!("hello, world! &val: {:p} &str2: {:p}, &val2: {:p}",  &val, &str2, &val2);
}

Then str is no longer one of the fields in the state.

1
2
Suspend0 = type { [8 x i8], %"[async fn body@src/main.rs:3:23: 5:2]", [3 x i8] }
Suspend1 = type { i32, [4 x i8], %"[async fn body@src/main.rs:7:23: 9:2]", [7 x i8], %"alloc::string::String" }

It’s especially meaningful for thread-safe generator, which means the async function will be scheduler by multi-threading executor.

Traits like Send and Sync are automatically implemented for a Generator depending on the captured variables of the environment. Unlike closures, generators also depend on variables live across suspension points. This means that although the ambient environment may be Send or Sync, the generator itself may not be due to internal variables live across yield points being not-Send or not-Sync.

For example, MutexGuard is not Send, then it cannot be used with tokio multi-thread scheduler.

You need to scope it:

1
2
3
4
5
6
{
    let s = some_mutex.lock();
    // use `s` here
}
// we don't own the lock anymore.
yield;

Move

When you move out a variable, it will not be saved in the state anymore.

Interestingly, drop is actually a move, so when you drop a variable manually, it’s execlued from the state.

1
2
3
4
let str = "foo".to_string();
drop(str);
// str is gone
let val = foo().await;

Copy

So far, rust compiler will keep copied variables in the state even if they are actually not used later.

You could test it:

1
2
3
4
5
6
7
8
async fn foobar() {
    let str = "foo".to_string();
    let val = foo().await;
    drop(val);
    let str2 = "bar".to_string();
    let val2 = bar().await;
    println!("hello, world! &val: {:p} &str2: {:p}, &val2: {:p}",  &val, &str2, &val2);
}

Yes, drop will copy for types implements Copy trait, although it’s no-op.

This effectively does nothing for types which implement Copy, e.g. integers. Such values are copied and then moved into the function, so the value persists after this function call.

From llvm IR, we could see that val is still there:

1
2
Suspend1 = type { %"alloc::string::String", i32, [4 x i8], %"[async fn body@src/main.rs:7:23: 9:2]", [7 x i8], %"alloc::string::String" }
Suspend0 = type { %"alloc::string::String", [8 x i8], %"[async fn body@src/main.rs:3:23: 5:2]", [7 x i8] }

shadowed variables

As known, shadowed variables will not be dropped until they go out of scope.

1
2
3
4
5
6
7
8
async fn foobar() {
    let str = "foo".to_string();
    let str = 233;
    let val = foo().await;
    let str2 = "bar".to_string();
    let val2 = bar().await;
    println!("hello, world! &str: {:p}, &val: {:p} &str2: {:p}, &val2: {:p}", &str, &val, &str2, &val2);
}

From the llvm IR, we could see that shadowed value "foo" still occupies one field of the state:

1
2
Suspend0 = type { %"alloc::string::String", i32, [12 x i8], %"[async fn body@src/main.rs:3:23: 5:2]", [7 x i8] }
Suspend1 = type { %"alloc::string::String", i32, i32, [8 x i8], %"[async fn body@src/main.rs:7:23: 9:2]", [7 x i8], %"alloc::string::String" }

How Pin looks like in assembly code?

Pin is some kind of smart pointer, which applies some constraint that the compiler should not move its address or swap its content, so that self-referenced struct is feasible, which is common case in async function, because we most likely use variable references in the function body.

Pin does not appears in the assembly code, even Deref and field projection will be optimized out.