Posts
Search
Contact
Cookies
About
RSS

What's faster calling a function with a struct or a pointer to a struct ?

Added 5 Apr 2021, 12:34 p.m. edited 18 Jun 2023, 1:23 p.m.

Well this one should be easy ? Right ? obviously as C is pass by value if you're passing a few pointers, they are smaller than passing 32 floats (2 Matrix structs).

Well initially a quick trip onto godbolt and you can easily convince yourself its an open and shut case... looking at how a function gets called with structs

        lea     rax, [rbp-216]
        push    QWORD PTR [rbp-96]
        push    QWORD PTR [rbp-104]
        push    QWORD PTR [rbp-112]
        push    QWORD PTR [rbp-120]
        push    QWORD PTR [rbp-128]
        push    QWORD PTR [rbp-136]
        push    QWORD PTR [rbp-144]
        push    QWORD PTR [rbp-152]
        push    QWORD PTR [rbp-32]
        push    QWORD PTR [rbp-40]
        push    QWORD PTR [rbp-48]
        push    QWORD PTR [rbp-56]
        push    QWORD PTR [rbp-64]
        push    QWORD PTR [rbp-72]
        push    QWORD PTR [rbp-80]
        push    QWORD PTR [rbp-88]
        mov     rdi, rax
        call    MatrixMultiply
        sub     rsp, -128
        mov     rax, QWORD PTR [rbp-224]
        mov     rcx, QWORD PTR [rbp-216]
        mov     rbx, QWORD PTR [rbp-208]
        mov     QWORD PTR [rax], rcx
        mov     QWORD PTR [rax+8], rbx
        mov     rcx, QWORD PTR [rbp-200]
        mov     rbx, QWORD PTR [rbp-192]
        mov     QWORD PTR [rax+16], rcx
        mov     QWORD PTR [rax+24], rbx
        mov     rcx, QWORD PTR [rbp-184]
        mov     rbx, QWORD PTR [rbp-176]
        mov     QWORD PTR [rax+32], rcx
        mov     QWORD PTR [rax+40], rbx
        mov     rcx, QWORD PTR [rbp-168]
        mov     rbx, QWORD PTR [rbp-160]
        mov     QWORD PTR [rax+48], rcx
        mov     QWORD PTR [rax+56], rbx

Yes indeed all the values from the the two structs are being copied onto the stack, and on return with some register shuffling a Matrix is copied back (bare in mind its copying blocks of memory here not floats, its a block of memory that happens to contain floats)

Now lets look at what happens with a pointer only version of the same function

        mov     rdx, QWORD PTR [rbp-24]
        mov     rax, QWORD PTR [rbp-16]
        mov     rsi, rdx
        mov     rdi, rax
        call    MatrixMultiply
        mov     QWORD PTR [rbp-32], rax

Well, that's an obvious slam dunk then! pointers must be faster.... but wait its not quite that simple, if we look inside the function, take this line

result.m0 = left.m0*right.m0 + left.m1*right.m4 + left.m2*right.m8 + left.m3*right.m12;

and its pointer equivalent

result->m0 = left->m0*right->m0 + left->m1*right->m4 + left->m2*right->m8 + left->m3*right->m12;

In the struct version this takes 16 instructions, however the pointer version takes 25 instructions, there is a penalty to be paid for the pointer indirection (dereferencing)

but wait there's even more! all this is looking without any kind of optimisation (-O0) look again with (-Ofast) and yes its even optimised the function, but it seems to go further and inlines the call (with the function never being called) so our struct function call as near as I can tell looks to be around 56 instruction (impressive!) The pointer version is interesting as it uses the ret instruction in the MatrixMultiply to return from the foo function with the function taking some 78 instructions.

So what's needed is some kind of simple test, where values can't be optimised out (it must use a result based on all elements of the matrix) and be as analogous as possible.

I strived to make the two test examples as fair as I could, with an eye to real world use, this complicates the pointer version a little meaning the end user of an API is responsible for allocating space for the result, the alternative would be that the function would allocate and the user would have to free, not a fair test and almost certainly likely to explode in the hands of a novice programmer!

So lets see some typical timing for our examples, I'm looking at multiple runs and only noting typical time values for user mode. The function foo is called 1,000,000 times this function fills two matrices with random values and calls a function to multiply them together. It is a highly unscientific "test" but here are some results

struct -O0struct -Ofastptrs -O0ptrs -Ofast
typical user time0.219s0.211s0.243s0.191s

Interestingly unoptimised passing structures is quite respectable, even beating the unoptimised pointer version, looking at the optimised times passing structs really isn't that much of an overhead, but I here people cry its nearly 10% faster, but there really are practical considerations to consider.

How easy is it to use pointers, okay once you understand them they are nothing to be scared of, but how does it impact readability of your code, how easy will the code be to maintain and debug years down the line?

On top of that, consider how often you might be calling this function or even every function call in a library like Raylib. Compared with the time spent elsewhere, this overhead is likely trivial, bare in mind to see a difference in the above timing I had to call foo 1,000,000 times just to see a time difference of 20ms overall and just 20ns per call, which is basically trivial.

So what's the take home here ? Don't optimise early comes to mind straight away, sometimes there actually are more important things than raw speed, and before even attempting to optimise code, you should wait till there is a definite problem and profile the application, and even then you may be better of using a subtly different algorithm in the rest of your code, before worrying about passing by value.

In conclusion the answer to question I posed at the start of the post, well passing pointers is trivially faster, but do reread the previous paragraph....

For reference here's the code I used for my highly unscientific test... you might notice that for convenience I'm using a union and anonymous struct instead of the "official" Matrix struct, as far as I can see this has no impact on the timing or the code under test (where the union float array is used its the same in both versions of the code). Also note, I'm printing a result (the f variable) this is an attempt to ensure static analysis during optimisation doesn't do something hinky !

#include <stdlib.h>
#include <stdio.h>

typedef union Matrix 
{
    float m[16];
    struct {
        float m0, m4, m8, m12;
        float m1, m5, m9, m13;
        float m2, m6, m10, m14;
        float m3, m7, m11, m15;
    };
} Matrix;

Matrix MatrixMultiply(Matrix left, Matrix right)
{
    Matrix result = { 0 };

    result.m0 = left.m0*right.m0 + left.m1*right.m4 + left.m2*right.m8 + left.m3*right.m12;
    result.m1 = left.m0*right.m1 + left.m1*right.m5 + left.m2*right.m9 + left.m3*right.m13;
    result.m2 = left.m0*right.m2 + left.m1*right.m6 + left.m2*right.m10 + left.m3*right.m14;
    result.m3 = left.m0*right.m3 + left.m1*right.m7 + left.m2*right.m11 + left.m3*right.m15;
    result.m4 = left.m4*right.m0 + left.m5*right.m4 + left.m6*right.m8 + left.m7*right.m12;
    result.m5 = left.m4*right.m1 + left.m5*right.m5 + left.m6*right.m9 + left.m7*right.m13;
    result.m6 = left.m4*right.m2 + left.m5*right.m6 + left.m6*right.m10 + left.m7*right.m14;
    result.m7 = left.m4*right.m3 + left.m5*right.m7 + left.m6*right.m11 + left.m7*right.m15;
    result.m8 = left.m8*right.m0 + left.m9*right.m4 + left.m10*right.m8 + left.m11*right.m12;
    result.m9 = left.m8*right.m1 + left.m9*right.m5 + left.m10*right.m9 + left.m11*right.m13;
    result.m10 = left.m8*right.m2 + left.m9*right.m6 + left.m10*right.m10 + left.m11*right.m14;
    result.m11 = left.m8*right.m3 + left.m9*right.m7 + left.m10*right.m11 + left.m11*right.m15;
    result.m12 = left.m12*right.m0 + left.m13*right.m4 + left.m14*right.m8 + left.m15*right.m12;
    result.m13 = left.m12*right.m1 + left.m13*right.m5 + left.m14*right.m9 + left.m15*right.m13;
    result.m14 = left.m12*right.m2 + left.m13*right.m6 + left.m14*right.m10 + left.m15*right.m14;
    result.m15 = left.m12*right.m3 + left.m13*right.m7 + left.m14*right.m11 + left.m15*right.m15;

    return result;
}

Matrix foo() 
{

    Matrix m1, m2, m3;

    for (int i=0; i<16; i++)
    {
        m1.m[i] = (float)rand()/(float)(RAND_MAX);
        m2.m[i] = (float)rand()/(float)(RAND_MAX);
    }

    m3 = MatrixMultiply(m1, m2);

    return m3;
}

int main() 
{
    float f = 0;
    
    for (int i= 0; i< 1000000; i++)
    {
        Matrix m = foo();
        f += m.m[0] + m.m[1] + m.m[2] + m.m[3];
        f += m.m[4] + m.m[5] + m.m[6] + m.m[7];
        f += m.m[8] + m.m[9] + m.m[10] + m.m[11];
        f += m.m[12] + m.m[13] + m.m[14] + m.m[15];
    }
    
    printf("f = %f\n", f);
}

#include <stdlib.h>
#include <stdio.h>

typedef union Matrix 
{
    float m[16];
    struct {
        float m0, m4, m8, m12;
        float m1, m5, m9, m13;
        float m2, m6, m10, m14;
        float m3, m7, m11, m15;
    };
} Matrix;

void MatrixMultiply(Matrix* result, Matrix* left, Matrix* right)
{
    result->m0 = left->m0*right->m0 + left->m1*right->m4 + left->m2*right->m8 + left->m3*right->m12;
    result->m1 = left->m0*right->m1 + left->m1*right->m5 + left->m2*right->m9 + left->m3*right->m13;
    result->m2 = left->m0*right->m2 + left->m1*right->m6 + left->m2*right->m10 + left->m3*right->m14;
    result->m3 = left->m0*right->m3 + left->m1*right->m7 + left->m2*right->m11 + left->m3*right->m15;
    result->m4 = left->m4*right->m0 + left->m5*right->m4 + left->m6*right->m8 + left->m7*right->m12;
    result->m5 = left->m4*right->m1 + left->m5*right->m5 + left->m6*right->m9 + left->m7*right->m13;
    result->m6 = left->m4*right->m2 + left->m5*right->m6 + left->m6*right->m10 + left->m7*right->m14;
    result->m7 = left->m4*right->m3 + left->m5*right->m7 + left->m6*right->m11 + left->m7*right->m15;
    result->m8 = left->m8*right->m0 + left->m9*right->m4 + left->m10*right->m8 + left->m11*right->m12;
    result->m9 = left->m8*right->m1 + left->m9*right->m5 + left->m10*right->m9 + left->m11*right->m13;
    result->m10 = left->m8*right->m2 + left->m9*right->m6 + left->m10*right->m10 + left->m11*right->m14;
    result->m11 = left->m8*right->m3 + left->m9*right->m7 + left->m10*right->m11 + left->m11*right->m15;
    result->m12 = left->m12*right->m0 + left->m13*right->m4 + left->m14*right->m8 + left->m15*right->m12;
    result->m13 = left->m12*right->m1 + left->m13*right->m5 + left->m14*right->m9 + left->m15*right->m13;
    result->m14 = left->m12*right->m2 + left->m13*right->m6 + left->m14*right->m10 + left->m15*right->m14;
    result->m15 = left->m12*right->m3 + left->m13*right->m7 + left->m14*right->m11 + left->m15*right->m15;
}


Matrix *m1, *m2;
    
void foo(Matrix* m3) 
{

    for (int i=0; i<16; i++)
    {
        m1->m[i] = (float)rand()/(float)(RAND_MAX);
        m2->m[i] = (float)rand()/(float)(RAND_MAX);
    }

    MatrixMultiply(m3, m1, m2);

}

int main() 
{
    Matrix* m = malloc(sizeof(Matrix));
    m1 = malloc(sizeof(Matrix));
    m2 = malloc(sizeof(Matrix));  
    
    float f = 0;
    
    for (int i= 0; i< 1000000; i++)
    {
        foo(m);
        f += m->m[0] + m->m[1] + m->m[2] + m->m[3];
        f += m->m[4] + m->m[5] + m->m[6] + m->m[7];
        f += m->m[8] + m->m[9] + m->m[10] + m->m[11];
        f += m->m[12] + m->m[13] + m->m[14] + m->m[15];

    }
    printf("f = %f\n", f);
    free(m1);
    free(m2);
    free(m);
}

Enjoy!