Rust ramblings: monomorphisation, references, async, and exceptions

. 1604 words; about 9 minutes.
Category: Rust.

I accidentally tripped over a nice optimisation when refactoring a generic exception handler in Rust which made it zero-cost. Come follow my journey into some dark corners.

Reducing the size of monomorphised functions

As is good and proper, one first writes simple—and thus hopefully correct—code before optimising it. Here's the basic implementation of a wrapper function for a web framework which catches panics, logs them, and returns a 500 error:

async fn panic_500<'a, Fut: Future<Output = Output> + Send, App: Send>(
    ctx: App, wrapped: impl FnOnce(App) -> Fut + Send,
) -> Output {
    let result = AssertUnwindSafe(wrapped(ctx)).catch_unwind().await;
    match result {
        Ok(output) => output,
        Err(panic) => {
            if let Some(message) = panic.downcast_ref::<&str>() {
                error!(?panic, "panic occurred: {}", message);
            } else if let Some(message) = panic.downcast_ref::<String>() {
                error!(?panic, "panic occurred: {}", message);
            } else {
                error!(?panic, "Panic occurred.");
            }
            Output::status(500, "Server Error")
        }
    }
}

This works fine, but because it is a generic function, the entire body is monomorphised which causes substantial code bloat since there is a panic_500 wrapper around every service endpoint. Although this function is relatively short, the error! macro generates quite a lot of string-formatting boilerplate and so it stood out in cargo bloat reports. The standard approach for reducing the space overhead of any nontrivial generic function is to move the non-generic parts into an inner function, in this case log_error:

async fn panic_500<'a, Fut: Future<Output = Output> + Send, App: Send>(
    ctx: App, wrapped: impl FnOnce(App) -> Fut + Send,
) -> Output {
    fn log_error(panic: Box<dyn Any + Send>) -> Output {
        if let Some(message) = panic.downcast_ref::<&str>() {
            error!(?panic, "panic occurred: {}", message);
        } else if let Some(message) = panic.downcast_ref::<String>() {
            error!(?panic, "panic occurred: {}", message);
        } else {
            error!(?panic, "Panic occurred.");
        }
        Output::status(500, "Server Error")
    }

    let result = AssertUnwindSafe(wrapped(ctx)).catch_unwind().await;
    match result {
        Ok(output) => output,
        Err(panic) => log_error(panic),
    }
}

This is essentially the final version I use, and such a trivial transformation is routine enough to not itself warrant a blog post. However, Clippy gives this diagnostic which sows doubt:

warning: this argument is passed by value, but not consumed in the function body
|
|     fn log_error(panic: Box<dyn Any + Send>) -> Output {
|                         ^^^^^^^^^^^^^^^^^^^ help: consider taking a reference instead: `&Box<dyn Any + Send>`
|
= note: `#[warn(clippy::needless_pass_by_value)]` implied by `#[warn(clippy::pedantic)]`
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_pass_by_value

One might thus be tempted to apply the suggested change. Additionally, functions taking a reference-to-smartpointer is a code smell: due to the immutability, the smartness is moot and the function is more reusable if it takes a reference to the inner data instead. (So &Box<T> becomes &T, &String becomes &str, and so on.) Further, the inner function doesn't need its value to be Send so we can remove that trait bound too. Applying these mechanical transformations results in something like this:

async fn panic_500<'a, Fut: Future<Output = Output> + Send, App: Send>(
    ctx: App, wrapped: impl FnOnce(App) -> Fut + Send,
) -> Output {
    fn log_error(panic: &dyn Any) -> Output { // <- changed signature
        if let Some(message) = panic.downcast_ref::<&str>() {
            error!(?panic, "panic occurred: {}", message);
        } else if let Some(message) = panic.downcast_ref::<String>() {
            error!(?panic, "panic occurred: {}", message);
        } else {
            error!(?panic, "Panic occurred.");
        }
        Output::status(500, "Server Error")
    }

    let result = AssertUnwindSafe(wrapped(ctx)).catch_unwind().await;
    match result {
        Ok(output) => output,
        Err(panic) => log_error(&panic), // <- changed to pass-by-reference
    }
}

But should you do this? It depends.

Remember, the object of the exercise is to move as much code as is sensible from the outer generic function to the inner non-generic function. By applying these transformations, we've moved work back into the outer function:

  • Pass-by-reference leaves ownership of panic in the outer function, and so it has the responsibility of dropping it after calling log_error. The drop is "just" a call to core::ptr::drop_in_place but still adds a handful of extra instructions to perform a subroutine call and cleanup.

  • Turning a smartpointer into a reference involves a call to its Deref::deref impl. As it happens, Box<dyn Any + Send> is bitwise identical to &dyn Any and requires no other work to convert so it's a no-op, but this may not be the case for other smartpointers.

  • Because it has to do work after calling log_error, it can't tail-call it, so that's more overhead to perform a subroutine call and cleanup instead of just a straight branch.

  • Since the outer function is small, it is likely to get inlined into callers, so these extra instructions are multiplied not by the number of monomorphised instances, but by the number of call sites, of which there are at least as many.

On the flip side, these extra instructions are only a few tens of bytes, versus the few thousands already saved by creating an inner function, so one could argue that tail-call-by-value is very much the kind of premature optimisation we're warned about. I counter that there's no downside. However, it's not exactly a hill I'd die on as life's too short to save a few bytes in a (FX: checks notes) 7MB executable.

Lifetimes and types and borrows, oh my!

The micro-optimisation of tail calling into an inner function accepting an owned value is cute, but it turns out that it's sometimes not possible in Rust. Here is another function which sends a file back to the HTTP client:

async fn send_file(path: impl AsRef<Path> + Send) -> Result<Output> {
    todo!("unimportant, but results in ~12kB of x86 code");
}

…which is generic and thus sees immediate space savings from being mechanically-transformed into:

async fn send_file(path: impl AsRef<Path> + Send) -> Result<Output> {
    async fn inner(path: &Path) -> Result<Output> {
        todo!("unimportant, but results in ~12kB of x86 code");
    }

    let path = path.as_ref();
    inner(path).await
}

Again, this is pretty much the version I actually use in the real-world version of this code. But there's that niggling &Path there. Can we make that owned somehow so we tail-call? The initial hunch is "no", and it turns out that this hunch is correct, but maybe not for the expected reasons.

The async fn sugar is hiding a lot of stuff here. Hand-desugaring results in something like this:

fn send_file2<'a>(path: &'a Path) -> impl Future<Output = Result<Output>> + 'a {
    async {
        todo!("unimportant, but results in ~12kB of x86 code");
    }
}

fn send_file(path: impl AsRef<Path> + Send) -> impl Future<Output = Result<Output>> {
    async move {
        let path2 = path.as_ref();
        send_file2(path2).await
    }
}

An async block—massive handwave alert—returns a future, and is essentially a closure which provides the future's poll function. Variables captured in async blocks must outlive the future. In this case, we're capturing path, which is then passed to send_file2 by reference so no tail-call optimisation.

async { foo.await } can usually be replaced with just foo, much like || bar() can be replaced by bar. path2 is Copy so is effectively owned by the function (although of course its referent is not) and can be passed by value. So we just move that outside of the async block, transform that block, and we're done:

fn send_file(path: impl AsRef<Path> + Send) -> impl Future<Output = Result<Output>> {
    let path2 = path.as_ref();
    send_file2(path2)
}

…to be informed by the compiler that the "borrowed value does not live long enough". Fair enough, we can throw explicit liftimes at the problem and hope for the best:

fn send_file<'a>(path: impl AsRef<Path> + Send + 'a) -> impl Future<Output = Result<Output>> + 'a {
    let path2 = path.as_ref();
    send_file2(path2)
}

Sadly, this sort of thing will never work, no matter how much fiddling you do. path.as_ref() desugars to AsRef::as_ref(&path), and that function has the signature fn as_ref(&self) -> &T. So path2 has the same lifetime as &path, i.e. the function's scope, and not the 'a lifetime of path itself.

All you can do at this point is curse Rust's type system and standard library for being broken here, undo the desugaring to produce the original, working, readable code, and get on with your day. Sometimes optimisations aren't worth the effort, or even possible.

Exception bifurcation

Back to panic_500. There is a second surprise bonus of this inner function approach which I discovered when panic_500 fell off the clippy bloat radar. A Rust panic uses the (C++) exception-handling mechanism, which—massive handwave again—duplicates the code paths at the panic and generates separate functions for panic and non-panic paths. The panic path never executes the non-panic code and vice versa, so the never-executed code gets marked as unreachable and optimised away. What does panic_500 look like if we delete all of the code which only gets executed on panic?

async fn panic_500<'a, Fut: Future<Output = Output> + Send, App: Send>(
    ctx: App, wrapped: impl FnOnce(App) -> Fut + Send,
) -> Output {
    wrapped(ctx).await
}

As noted above, async { wrapped(ctx).await } transforms to wrapped(ctx), and such simple function calls often get inlined. In my case, it is, and panic_500 completely disappears from the output executable. Zero-cost Rust strikes again!

One might therefore ask if we needed to move the error-reporting into its own log_error function after all. The answer is yes: Although it is optimised-away in the normal code path, it still appears in the exception-handling path. It is probably monomorphised there, bloating the binary. This bloat may even be more insidious since instrumentation may fail to look at exceptional code and you will not realise it's there.