vda19999 Apr 26 2023 at 20:35

On the difference between regular functions and Lambdas

Medium

11 min

2.7K

The point of this article is to explore Lambda functions, their dirrerences from regular functions and how they are implemented, based on C++, Python and Java programming languages.

Throughout this article I will be using godbolt.org to compile code and see machine code or byte code.

C++

What is a function? One could say that a function is a group of statements, prefixed with a function declaration that defines the input and output types. But in terms of machine code, we could say that a function is a sequence of assembly instructions and operands, prefixed with a label with function name and complying with the calling convention (that is, “returns” result and accepts parameters in a certain way and preserves cerain registers).

Besides regular functions, C++ has lambda functions that can be defined inside regular functions. Similarly to ordinary functions, they accept arguments and return values, but, in contrast to regular functions, they can “capture”, either by value or by reference, variables from the scope where they are defined. Capturing means that the lambda function can access (read or write) variables in the outer scope or their copies. Let`s see some examples and which assembly code they are compiled to. I will use gcc 12.2 with -Os flag (the less assembly will be generated, the easier to read).

#include <vector>
#include <cstdio>

// extern so that it won`t be inlined
extern void for_each(const std::vector<int>& vec, void (*func) (int));

void example(const std::vector<int>& vec) {
    for_each(vec, [] (int n) {
        printf("%d\n", n);
    });
}

.LC0:
        .string "%d\n"
example(std::vector<int, std::allocator<int> > const&)::{lambda(int)#1}::_FUN(int):
        movl    %edi, %esi
        xorl    %eax, %eax
        movl    $.LC0, %edi
        jmp     printf
example(std::vector<int, std::allocator<int> > const&):
        movl    $example(std::vector<int, std::allocator<int> > const&)::{lambda(int)#1}::_FUN(int), %esi
        jmp     for_each(std::vector<int, std::allocator<int> > const&, void (*)(int))
// note that using jmp instead of call is an optimization; the called 
// function will return to the callee of the current function, skipping `example`

So far, we haven`t seen anything interesting: the lambda function that we declared inside example was simply declared as a normal function in the assembly.

Let`s see a more interesting example, where our lambda would capture a variable from the outside.

#include <vector>
#include <cstdio>

extern void for_each(const std::vector<int>& vec, void (*func) (int));

void example(const std::vector<int>& vec, int threshold) {
    for_each(vec, [threshold] (int n) {
        if (n > threshold) {
            printf("%d\n", n);
        }
    });
}

However, this won`t compile:

<source>: In function 'void example(const std::vector<int>&, int)':
<source>:7:19: error: cannot convert 'example(const std::vector<int>&, int)::<lambda(int)>' to 'void (*)(int)'
    7 |     for_each(vec, [threshold] (int n) {
      |                   ^~~~~~~~~~~~~~~~~~~~~
      |                   |
      |                   example(const std::vector<int>&, int)::<lambda(int)>
    8 |         if (n > threshold) {
      |         ~~~~~~~~~~~~~~~~~~~~
    9 |             printf("%d\n", n);
      |             ~~~~~~~~~~~~~~~~~~
   10 |         }
      |         ~          
   11 |     });
      |     ~              
<source>:4:58: note:   initializing argument 2 of 'void for_each(const std::vector<int>&, void (*)(int))'
    4 | extern void for_each(const std::vector<int>& vec, void (*func) (int));
      |                                                   ~~~~~~~^~~~~~~~~~~
Compiler returned: 1

It turns out that a lambda function declared inside another function will have a more complex type than just a normal function, in this case, example(const std::vector<int>&, int)::<lambda(int)> — the type will mention the outer function.

This means two important things. Firstly, the lambda function we wrote can`t be compiled to an assembly function and is not a function in the sense of the definition given above. Secondly, the lambda function can either be used inside the scope where it was defined or passed to a template function (a template function is a template, which is used by the compiler to generate function definitions based on the the substitutions of template arguments), which means that a new implementation of the template function will be generated by the compiler for each lambda function we declare. Let`s see an example:

#include <vector>
#include <cstdio>

template <typename Callable>
void __attribute__ ((noinline)) for_each(const std::vector<int>& vec, Callable func) {
    for (const auto& elem : vec) {
        func(elem);
    }
}

void example(const std::vector<int>& vec, int threshold) {
    for_each(vec, [threshold] (int n) {
        if (n > threshold) {
            printf("%d\n", n);
        }
    });
}

.LC0:
        .string "%d\n"
void for_each<example(std::vector<int, std::allocator<int> > const&, int)::{lambda(int)#1}>(std::vector<int, std::allocator<int> > const&, example(std::vector<int, std::allocator<int> > const&, int)::{lambda(int)#1}):
        pushq   %r12
        pushq   %rbp
        movl    %esi, %ebp
        pushq   %rbx
        movq    8(%rdi), %r12
        movq    (%rdi), %rbx
.L2:
        cmpq    %rbx, %r12
        je      .L7
        movl    (%rbx), %esi
        cmpl    %ebp, %esi
        jle     .L3
        movl    $.LC0, %edi
        xorl    %eax, %eax
        call    printf
.L3:
        addq    $4, %rbx
        jmp     .L2
.L7:
        popq    %rbx
        popq    %rbp
        popq    %r12
        ret
example(std::vector<int, std::allocator<int> > const&, int):
        jmp     void for_each<example(std::vector<int, std::allocator<int> > const&, int)::{lambda(int)#1}>(std::vector<int, std::allocator<int> > const&, example(std::vector<int, std::allocator<int> > const&, int)::{lambda(int)#1})

As we can see, an implementation of for_each was generated and the lambda function was inlined into it — and for_each turned actually to be a function accepting a vector and an integer.

This means that the implementation of lambda functions in C++ is based on templates — and in most cases, it will be inlined.

Let`s look at a slightly more complex example:

#include <vector>
#include <cstdio>

template <typename Callable>
void __attribute__ ((noinline)) for_each(const std::vector<int>& vec, Callable func) {
    for (const auto& elem : vec) {
        func(elem);
    }
}

auto get_filtering_lambda(int threshold) {
    return [threshold] (int n) {
        if (n > threshold) {
            printf("%d\n", n);
        }
    };
}

void example(const std::vector<int>& vec, int threshold) {
    for_each(vec, get_filtering_lambda(threshold));
}

What assembly would we expect to see here? As we remember, for_each does not actually accept a function, rather, it accepts the values and references that the lambda captures. This means that get_filtering_lambda only needs to provide these captured values:

.LC0:
        .string "%d\n"
void for_each<get_filtering_lambda(int)::{lambda(int)#1}>(std::vector<int, std::allocator<int> > const&, get_filtering_lambda(int)::{lambda(int)#1}):
        pushq   %r12
        pushq   %rbp
        movl    %esi, %ebp
        pushq   %rbx
        movq    8(%rdi), %r12
        movq    (%rdi), %rbx
.L2:
        cmpq    %rbx, %r12
        je      .L7
        movl    (%rbx), %esi
        cmpl    %ebp, %esi
        jle     .L3
        movl    $.LC0, %edi
        xorl    %eax, %eax
        call    printf
.L3:
        addq    $4, %rbx
        jmp     .L2
.L7:
        popq    %rbx
        popq    %rbp
        popq    %r12
        ret
get_filtering_lambda(int):
        movl    %edi, %eax
        ret
example(std::vector<int, std::allocator<int> > const&, int):
        jmp     void for_each<get_filtering_lambda(int)::{lambda(int)#1}>(std::vector<int, std::allocator<int> > const&, get_filtering_lambda(int)::{lambda(int)#1})

Note that the template implementation of lambdas is not the only possible one. The other possible one would rely on virtual classes. We could translate a lambda function into a regular function, that accepts the arguments that the lambda accepts and the values that the lambda captures. Then the function can be packed with the captured values into an obect and passed to a function.

class IIntConsumer {
public:
    virtual ~IIntConsumer() = 0;

    virtual void consume(int) = 0;
};

void for_reach(const vector<int>& vec, IIntConsumer* func) {
    for (const auto& elem : vec) {
        func->consume(elem);
    }
}

For our lambda function from the second example above, the compiler would generate a class, extending IIntConsumer and a function of two arguments, then create an object of that class and pass it to for_each.

void lambda_translation(int a, int captured0) {
    if (a > captured0) {
        printf("%d", a);
    }
}

class LambdaHolder: public IIntConsumer {
public:
    LambdaHolder(int a, int b, int captured0) : captured0(captured0) {}
    
    void consume(int a) override {
        lambda_translation(a, captured0);
    }
    
private:
    int captured0;
};

The advantage of such an approach is that the implementation of for_each and other functuions accepting lambdas would require just one implementation, which would reduce the size of the resulting binaries (and therefore, potentially reduice iTLB misses) and compilation time. However, calling virtual methods is more expensive than normal functions — and they can`t be inlined — therefore, in most cases, performance would be worse.

Python

Let`s start by writing some code and see how it is compiled into Python bytecode. I will use CPython 3.7. Python lambdas are one-line functions, so they aren`t very expressive — however, a similar concept to C++ lambdas are functions declared inside other functions — they can “capture” variables from the outside.

Before examining Python bytecode, let`s note three things. Firstly, the numbers in the left column correspond to the lines of the original source code (lines are enumerated starting from 1). This way, we can easily see which bytecode each line results in. Secondly, python bytecode is organised differently than machine code resulting from C, C++ or any other compiled language. In the latter case, the binary header will have the address of the instruction that should be the first one executed when the binary is executed. This instruction would be the beginning of the _start function of the original program (most source programs don`t define it; this function is defined in C standard library and it calls the familiar main after some preparation). On the other hand, in Python, the byte code will be executed from the first instruction and execute until return is reached. Finally, note that Python bytecode is executed on a stack machine, while an i386/amd64 processor is a register-based machine.

def func_generator(a):
    def inner(b):
        return a < b

    return inner

print(func_generator(2)(3))

1           0 LOAD_CONST               0 (<code object func_generator at 0x55ccef231910, file "example.py", line 1>)
              2 LOAD_CONST               1 ('func_generator')
              4 MAKE_FUNCTION            0
              6 STORE_NAME               0 (func_generator)

  7           8 LOAD_NAME                1 (print)
             10 LOAD_NAME                0 (func_generator)
             12 LOAD_CONST               2 (2)
             14 CALL_FUNCTION            1
             16 LOAD_CONST               3 (3)
             18 CALL_FUNCTION            1
             20 CALL_FUNCTION            1
             22 POP_TOP
             24 LOAD_CONST               4 (None)
             26 RETURN_VALUE

Disassembly of <code object func_generator at 0x55ccef231910, file "example.py", line 1>:
  2           0 LOAD_CLOSURE             0 (a)
              2 BUILD_TUPLE              1
              4 LOAD_CONST               1 (<code object inner at 0x55ccef204fc0, file "example.py", line 2>)
              6 LOAD_CONST               2 ('func_generator.<locals>.inner')
              8 MAKE_FUNCTION            8
             10 STORE_FAST               1 (inner)

  5          12 LOAD_FAST                1 (inner)
             14 RETURN_VALUE

Disassembly of <code object inner at 0x55ccef204fc0, file "example.py", line 2>:
  3           0 LOAD_DEREF               0 (a)
              2 LOAD_FAST                0 (b)
              4 COMPARE_OP               0 (<)
              6 RETURN_VALUE

So, what happens in the bytecode above? First, the first line of the original Python source is executed, which pushes onto the stack the reference to the bytecode of the func_generator function and the name of the function. Then the MAKE_FUNCTIONinstruction is executed, which will create a function object within the Python VM. FInally, the newly created function object is saved into the func_generator variable.

The next line to be executed is the 7th. It pushes onto the stack the print and func_generator functions, the argiment to the latter, calls it, pushes the argument to the generated function, then calls the generated function and the print function and returns None.

When the func_generatoris called, the second line of code is executed. It loads the a variable (note that LOAD_CLOSURE and LOAD_FAST is the same thing), creates a tuple containig it, then loads a reference to the code of inner function and a name for the function. Then MAKE_FUNCTION is called — it appears that it is passed, in this case, a tuple of variables that the inner function will capture. This way, a Python function is not only a sequence of bytecode instructions, but also some auxiliary information, like variables captured from the outside. For confirmation, let`s read the description of the insruction in Python docs:

Finally, let`s see what the inner function does (line 3). It loads the parameter onto the stack, then the variable captured from the outside (it would be read from the function object) and compares them.

To conclude, we have seen that:

Python functions are not just sequences of bytecode instructions, but objects containing some data
Support for capturing values from the outside is provided by the Python VM
A Python program may, thus, generate an arbitrary number of functions

Java

Now let`s look what happens in Java. I will be using openjdk 19.

import java.util.ArrayList;
import java.util.concurrent.atomic.*;

class Example {
    public static void caller3(ArrayList<Integer> a, AtomicInteger b, AtomicBoolean boo) {
        a.forEach((i) -> {
            if (i > b.get()) {
                System.out.println(i);
                boo.set(true);
            } else {
                b.decrementAndGet();
                boo.set(false);
            }
        });
    }
}

The resulting bytecode will be:

class Example {
   public static void caller3(java.util.ArrayList<java.lang.Integer>, java.util.concurrent.atomic.AtomicInteger, java.util.concurrent.atomic.AtomicBoolean);
       0: aload_0
       1: aload_1
       2: aload_2
       3: invokedynamic #7,  0              // InvokeDynamic #0:accept:(Ljava/util/concurrent/atomic/AtomicInteger;Ljava/util/concurrent/atomic/AtomicBoolean;)Ljava/util/function/Consumer;
       8: invokevirtual #11                 // Method java/util/ArrayList.forEach:(Ljava/util/function/Consumer;)V
      11: return

  private static void lambda$caller3$0(java.util.concurrent.atomic.AtomicInteger, java.util.concurrent.atomic.AtomicBoolean, java.lang.Integer);
       0: aload_2
       1: invokevirtual #17                 // Method java/lang/Integer.intValue:()I
       4: aload_0
       5: invokevirtual #23                 // Method java/util/concurrent/atomic/AtomicInteger.get:()I
       8: if_icmple     26
      11: getstatic     #28                 // Field java/lang/System.out:Ljava/io/PrintStream;
      14: aload_2
      15: invokevirtual #34                 // Method java/io/PrintStream.println:(Ljava/lang/Object;)V
      18: aload_1
      19: iconst_1
      20: invokevirtual #40                 // Method java/util/concurrent/atomic/AtomicBoolean.set:(Z)V
      23: goto          36
      26: aload_0
      27: invokevirtual #46                 // Method java/util/concurrent/atomic/AtomicInteger.decrementAndGet:()I
      30: pop
      31: aload_1
      32: iconst_0
      33: invokevirtual #40                 // Method java/util/concurrent/atomic/AtomicBoolean.set:(Z)V
      36: return

}

We can see that, firstly, the lambda function resulted in a static function that accepts both the variables that the lambda captures from the outside and secondly, that before calling ArrayList::forEach, it calls something and as a result it gets an object of a class that implements Consumer<Integer>. It is not that important now, how exactly this call is working — you can read about this in more detail here. What is more intgeresting for us, is what exactly class will be implementing Consumer<Integer>. It turns out that the class is generated by the JVM at runtime and can be seen if we add the -jdk.internal.lambda.dumpProxyClasses option to the JVM. After decompiling the class generated at runtime will be:

import java.util.concurrent.atomic.AtomicInteger;
import java.util.function.Consumer;

// $FF: synthetic class
final class Lambder$$Lambda$1 implements Consumer {
    private final AtomicInteger arg$1;

    private Lambder$$Lambda$1(AtomicInteger var1) {
        this.arg$1 = var1;
    }

    public void accept(Object var1) {
        Lambder.lambda$caller2$0(this.arg$1, (Integer)var1);
    }
}

So, what happens in Java is the following:

For a lambda function, the java compiler generates another function, which accepts both the objects captured from the outside and the arguments of the lambda function.
At runtime, the JVM generates a class that can hold the objects that the lambda captures and call the lambda with those arguments and arguments that the lambda accepts, creates an object of this class with the captured obects.
The object of the autogenerated class is passed to functions that accept functional interfaces.

Note that the solution implemented for Java is almost same as the theoretical approach for C++ that doesn`t require such extensive use of templates. When discussing C++, we said that this approach is worse in terms of performance. However, that conclusion was made based on the fact that code is executed on an x86/amd64 processor and not a JVM —which means it doesn`t directly apply here. Besides that, Java doesn`t have templates.

Tags:

Hubs: