An image broken down to its Y, U and V channels

Optimizing loops

written on April 9, 2020

tags: C, assembly, refactoring, x86, ARM

Up until quite recently, most of the C code I wrote didn't need to be that optimized: it was either for personal or university projects that ran on proper x86 processors and thus didn't need much low-level optimization. For all of them you could easily get away with good algorithmic design; the goal wasn't to gain a few milliseconds of execution time, but rather to not have your program run for 3-4 hours in order to spit out a result. I wrote my fair share of embedded C code in uni — after all, this is what I majored in —, however as with most uni projects there aren't really any heavy constraints, because the goal is to learn, not necessarily to write the most optimized code possible.

However, in the industry we often do go after the most optimized code. So, inspired by a problem I came across, I'd like to talk about a very well known — and a tad bit controversial (?) — topic in firmware: optimizing loops. So grab a cup of your favorite beverage and come along for the ride!

Getting into context

I could talk about this in an abstract, generic context, however I think problems like these are way more interesting if they actually relate to some real-world problem; after all, this is what engineering is all about.

So, I came across a very simple task: convert images stored in YUV format to simple grayscale. For the people who know what those are, you can jump ahead, but if you don't, here's a simple explanation: YUV is a very well known color encoding system that's used mainly because it presents the same information as RGB, but in a more compressible form. Basically, instead of splitting the colors of an image into three RGB channels (Red, Green and Blue), we split them into one luminance channel (Y) and two chrominance channels (U and V).

An example of an image broken down to its YUV components. Source: Wikipedia.

I'm not going to go into detail on how that's done or what it means, but the gist of it is that our human eyes are way more attentive to the luminance of an image rather than its chrominance; that means that we can compress the U and V channels quite a lot before it becomes noticeable. If you try to do the same with any of the three RGB channels, it immediately, visibly deteriorates the image. I guess the way I have intuitively "justified" to myself why that works is because the Y channel is essentially the black & white version of the image, and in some way it contains the "structure" of the image. Another way to put it is that, if I show you a picture of a dog, you can tell that it's a dog whether the picture is colored or not; the important information lies in the Y channel, not in the chrominance of the image.

Now that you get what YUV is, let's talk about how it's expressed in code. There are multiple different YUV formats for different types of file sizes, compression etc. In this problem, we're working with YUV422, which simply means that for every 2 "units" of chrominance there's twice as much luminance information. I really don't wanna say pixels here, because that would be misleading as the 422 actually refers to sampling, but anyway it just means that we have twice as much Y as we have U and V. In code, the YUV422 images are stored in uint8_t arrays, and they're laid out as follows:

uint8_t image[] = { Y, U, Y, V, Y, U, Y, V, Y, U, Y, V, ... }

Of course, each of these elements takes up 8 bits, so we end up with 32 bits or 4 bytes for 2 pixels of image. In order to convert this image to grayscale, all we need to do is cherry-pick the Y elements (basically the even indices of the array) and put them in a new array to create a grayscale image. Let's create a basic setup to test our conversion code!

A basic profiling setup

So, I'm going to create a small setup in order to easily profile my solutions. Here's the "project structure":

└── opti-loop/
    ├── main.c
    └── Makefile

The Makefile is pretty simple, I just want it to automatically generate the assembly code as well so that I can check what my code is doing under the hood:

Makefile

CC = gcc
CFLAGS = -Wall -O0

all: main

main: main.s
    $(CC) $(CFLAGS) main.s -o main

main.s: main.c
    $(CC) $(CFLAGS) -S main.c

clean:
    rm -rf main.s

I also omit the -g option (because that fills up the assembly file with a bunch of unnecessary things, which will make it harder for me to understand the assembly code), and I force it to not optimize the code (-O0) just to see what I can do to make things faster by myself.

Now let's see what the main.c file is all about. First, we need to define some constants:

main.c

#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include <string.h>

#define WIDTH            1920
#define HEIGHT           1080
#define TOTAL_PIXELS     (WIDTH * HEIGHT)
#define GRAYSCALE_PIXELS (TOTAL_PIXELS / 2)
#define ITERATIONS       500

typedef void (* gc_t)(uint8_t *, uint8_t *, uint32_t);

I'm going to use stdint.h because I eventually want to test on an MCU and thus I want to be conservative with the memory. Apart from that, stdio.h allows us to get access to printf, time.h allows us to profile our functions and string.h contains memset which is handy when you want to set entire blocks of memory to a certain value.

We obviously need to define the width and height of our image, and from that calculate the total pixels it will consist of (the grayscale version is half the size because, as explained before, we're only keeping the Y channel).

Lastly, ITERATIONS refer to the number of times each function will run in order to average out multiple execution times and get more stable results. Especially in a desktop x86 processor, we really don't know what else could be running and thus affect, albeit not much, our profiling.

The function pointer I have created on that last line represents the form of our grayscale converter functions: it takes two uint8_t pointers (the image arrays) and a uint32_t integer (the image size) as input, and does the conversion in-place without returning anything.

NOTE: I am avoiding dynamic memory allocation here because I want to keep the code bare-metal friendly. You could absolutely do this using non-static data structures.

Then, I need a function to set up my original "image":

main.c

void setup_image (uint8_t * image, uint32_t size)
{
    uint32_t i = 0;

    // Set elements at even indices to 1 and the rest to 0 
    for (i = 0; i < size; i++)
        image[i] = !(i % 2);
}

I'm just representing all the Y bytes by a 1 and all the U and V bytes by a 0. I'd also like to test if the conversion was successful (you never know, we might screw something up while trying to get it to work faster):

main.c

uint8_t test_conversion (uint8_t * image, uint32_t size)
{
    uint32_t i = 0;

    // Check that all the elements are equal to 1
    for (i = 0; i < size; i++)
        if (!image[i]) return 0;

    return 1;
}

And finally, I want a way to time my conversion functions, and display their average runtime:

main.c

uint8_t time_conversion (const char * name, gc_t gc_fun,
                         uint8_t * original_img, uint8_t * grayscale_img,
                         uint32_t original_size, uint32_t iterations)
{
    int i = 0;
    uint8_t ret = 0;
    clock_t start;
    clock_t stop;
    double average_time = 0.0;

    // First, check that the conversion function works
    gc_fun(original_img, grayscale_img, original_size);
    ret = test_conversion(grayscale_img, original_size / 2);
    if (!ret) {
        printf("ERROR: Conversion failed for %s.\n", name);
        return 0;
    }

    // Time the function for a number of iterations
    start = clock();

    for (i = 0; i < iterations; i++)
        gc_fun(original_img, grayscale_img, original_size);

    stop = clock();

    // Calculate the average time (in microseconds)
    average_time = 1e6 * ((double) stop - start) / CLOCKS_PER_SEC / iterations;
    printf("%-40s: %f us\n", name, average_time);

    // Wipe the grayscale image clean
    memset(grayscale_img, 0, original_size / 2);

    return 1;
}

This function does a few things:

It checks that the conversion function passed to it indeed works
It times it for a given number of iterations
It displays its name and its average execution time
It "resets" the grayscale image by setting all its elements to 0

Granted, all that setup maybe is a bit annoying but now we can easily test all of our converters! Suppose we had a bunch of converter functions called my_gc_1, my_gc_2, my_gc_3 etc, then this is how our main() would look like:

main.c

int main (int argc, char * argv[])
{
    // Image data
    uint8_t original_image[TOTAL_PIXELS];
    uint8_t grayscale_image[GRAYSCALE_PIXELS];
    uint8_t ret = 0;
    // Conversion functions
    gc_t gc_list[] = {
        my_gc_1,
        my_gc_2,
        my_gc_3
    };

    char * gc_names[] = {
        "my_gc_1",
        "my_gc_2",
        "my_gc_3"
    };
    // Counter
    int i = 0;

    setup_image((uint8_t *) original_image, TOTAL_PIXELS);

    for (i = 0; i < sizeof(gc_list) / sizeof(gc_t); i++) {
        ret = time_conversion(gc_names[i], gc_list[i],
                              (uint8_t *) original_image, 
                              (uint8_t *) grayscale_image,
                              TOTAL_PIXELS, ITERATIONS);
        if (!ret) return 1;
    }

    return 0;
}

We basically have an array of pointers to grayscale conversion functions and an array of the names of those functions, and by passing all that to our time_conversion function we get instant metrics on how our functions perform.

That was a lot of setup for such a simple problem! Let's get to actually trying out ideas.

Implementing on x86

We could start by the first thing that comes to mind, which — for me at least — is just a for-loop that picks all the even indices and puts them in the grayscale array:

main.c

void naive_gc (uint8_t * original, uint8_t * grayscale, uint32_t original_size)
{
    uint32_t i = 0;

    for (i = 0; i < original_size / 2; i++)
        grayscale[i] = original[2 * i];
}

Now all I need to do is add it to the array of function pointers and to the array of function names in main():

main.c

// Conversion functions
gc_t gc_list[] = {
    naive_gc
};

char * gc_names[] = {
    "naive_gc"
};

And now we can finally run the program for the first time:

$ make
gcc -Wall -O0 -S main.c
gcc -Wall -O0 main.s -o main
$ ./main
naive_gc                                : 1731.496000 us

So this naive solution is, as expected, not that great. Even on my 3.7 GHz processor it takes 1.7 milliseconds to run, so I would expect it to be much slower in an MCU. If you don't get why 1.7 ms is slow, the human response time is below 250 ms (I would expect this code to be at least 10 to 50 times more slow on an MCU processor, which generally has a frequency of 10-500 MHz), and embedded systems often deal with hard real-time constraints. Because the processor speed is slower, everything is slower, so gaining a few milliseconds here and there actually makes a lot of difference.

What is slowing the function down? Well, writing the for-loop limit condition as original_size / 2 is certainly convenient for us, the programmers, but I'm pretty sure the compiler doesn't optimize that out in -O0 mode. Let's look at the assembly code for naive_gc:

main.s

naive_gc:
.LFB3:
    ; Function setup
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movq    %rdi, -24(%rbp)
    movq    %rsi, -32(%rbp)
    movl    %edx, -36(%rbp)
; ------------------------------------------------------------------------------

    movl    $0, -4(%rbp)    ; int i = 0
    movl    $0, -4(%rbp)    ; i = 0 (for-loop initialization)
    jmp .L17                ; begin the for-loop
.L18:
    movl    -4(%rbp), %eax  ; EAX = i
    addl    %eax, %eax      ; EAX += EAX
    movl    %eax, %edx      ; EDX = EAX
    movq    -24(%rbp), %rax ; RAX = &original
    addq    %rdx, %rax      ; RAX += RDX (=EDX)
    movl    -4(%rbp), %ecx  ; ECX = i
    movq    -32(%rbp), %rdx ; RDX = &grayscale
    addq    %rcx, %rdx      ; RDX += RCX (=ECX)
    movzbl  (%rax), %eax    ; EAX = load byte from RAX (extend with 0s)
    movb    %al, (%rdx)     ; RDX = AL (=EAX)
    addl    $1, -4(%rbp)    ; i++
.L17:
    movl    -36(%rbp), %eax ; EAX = original_size
    shrl    %eax            ; EAX >>= 2
    cmpl    %eax, -4(%rbp)  ; compare(EAX, i) = EAX - i
    jb  .L18                ; jump if below, aka jump if carry flag = 1

; ------------------------------------------------------------------------------
    ; End of function
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc

The interesting part lies between the lines I've put in the .s file; the rest is just boilerplate function prologue/epilogue. Now, as you can probably see, .LFB3 is the "initialization" phase of the function, .L18 is the body of the for-loop and .L17 is the for-loop increment/condition test. The latter is what slows our code down considerably: at the end of each iteration, we need to recalculate the end condition (remember, original_size / 2) and compare it to i, which typically is done by moving the end condition number to a register, subtracting i from it, and checking the flags register for the result. We will, of course, need to check the value of i at each iteration, but there's no need to recalculate original_size / 2 because its value will not change.

As a first improvement, let's move the end condition to its own variable so that we won't have to recalculate it on each iteration:

main.c

void naive_precalculate_gc (uint8_t * original,
                            uint8_t * grayscale,
                            uint32_t original_size)
{
    uint32_t i = 0;
    uint32_t size = original_size / 2;

    for (i = 0; i < size; i++)
        grayscale[i] = original[2 * i];
}

Now let's add this new function to our two lists in main(), make and run the program:

$ ./main
naive_gc                                : 1738.254000 us
naive_precalculate_gc                   : 1482.010000 us

That certainly is faster! Indeed, by checking the assembly code we can now see that the third function "block" contains fewer instructions, as it just has to look in the stack for the end condition rather than recalculate it.

This is still pretty slow though. Can we get it to go faster? Let's borrow an old firmware trick, descending loops:

main.c

void naive_descending_gc (uint8_t * original,
                          uint8_t * grayscale,
                          uint32_t original_size)
{
    int32_t i = original_size / 2;

    while (i--)
        grayscale[i] = original[2 * i];
}

As you can see, this function also calculates the end condition just once; furthermore, it exploits the fact that 0 is a very "special" value. Think about it: when comparing i to an arbitrary value X, we need to subtract the one from the other and look at the result. If we instead compare i to 0, there are faster ways to check if they are equal. Let's take a look at the assembly code of the new function:

main.s

naive_descending_gc:
.LFB4:
    ; Function setup
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movq    %rdi, -24(%rbp)
    movq    %rsi, -32(%rbp)
    movl    %edx, -36(%rbp)
; ------------------------------------------------------------------------------

    movl    -36(%rbp), %eax ; EAX = original_size
    shrl    %eax            ; EAX >>= 2
    movl    %eax, -4(%rbp)  ; i = EAX (for-loop initialization)
    jmp .L20                ; begin the for-loop
.L21:
    movl    -4(%rbp), %eax  ; EAX = i
    addl    %eax, %eax      ; EAX += EAX
    movslq  %eax, %rdx      ; RDX = EAX
    movq    -24(%rbp), %rax ; RAX = &original
    addq    %rdx, %rax      ; RAX += RDX
    movl    -4(%rbp), %edx  ; EDX = i
    movslq  %edx, %rcx      ; RCX = EDX
    movq    -32(%rbp), %rdx ; RDX = &grayscale
    addq    %rcx, %rdx      ; RDX += RCX
    movzbl  (%rax), %eax    ; EAX = load byte from RAX (extend with 0s)
    movb    %al, (%rdx)     ; RDX = AL (=EAX)
.L20:
    movl    -4(%rbp), %eax ; EAX = i
    leal    -1(%rax), %edx ; EDX = &RAX - 1
    movl    %edx, -4(%rbp) ; i = EDX
    testl   %eax, %eax     ; if EAX (result of i--) != 0, ZF = 0,
                           ;    otherwise ZF = 1
    jne .L21               ; jump if ZF = 0

; ------------------------------------------------------------------------------
    ; End of function
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc

As before, the interesting part lies between the lines. Take a look at .L20: we're now just subtracting one from i, then running the test command, which performs an AND operation between %eax and itself; this means that if i is not 0, the result of i-- is not 0, and thus %eax is not 0. If we then calculate %eax & %eax, we will certainly get 1, because we are ANDing a nonzero value with itself. On the contrary, when we get to i = 0, this AND operation will return 0, and thus the loop will be over.

I'm not certain about what I'm going to say, but I'm pretty sure this is faster because a logical AND operation (which is what test uses) is faster than a subtraction (which is what comp uses). By the way, if you're wondering what the "l" is at the end of testl, that's just the size of the data being handled. In this case, "l" means "doubleword", basically a 32-byte integer.

Let's add that function to the lists and run our program again:

$ ./main
naive_gc                                : 1770.938000 us
naive_precalculate_gc                   : 1483.474000 us
naive_descending_gc                     : 1377.706000 us

And that's indeed faster! We can go faster though.

Another thing you might have noticed is that our loop has quite a few iterations, especially as the resolution of the image increases. A simple way to address this is to lower the amount of iterations by manipulating data with bigger pointers. Let me show you what I mean:

           ---------------64----------------
           -------32-------|-------32-------
             8   8   8   8   8   8   8   8
original:  | Y | U | Y | V | Y | U | Y | V |
grayscale: | Y | Y | Y | Y | Y | Y | Y | Y |

As you can see, there are as many Ys in 64 bits of the original image as there are Ys in 32 bits of the grayscale image. That means we can do the following:

main.c

void large_pointers_gc (uint8_t * original,
                        uint8_t * grayscale,
                        uint32_t original_size)
{
    uint32_t i = 0;
    uint64_t * p_original = (uint64_t *) original;
    uint32_t * p_grayscale = (uint32_t *) grayscale;
    uint32_t size = original_size / 8;

    for (i = 0; i < size; i++)
        p_grayscale[i] =  (p_original[i] & 0x00000000000000ffU) +
                         ((p_original[i] & 0x0000000000ff0000U) >> 8) +
                         ((p_original[i] & 0x000000ff00000000U) >> 16) +
                         ((p_original[i] & 0x00ff000000000000U) >> 24);
}

All we need to do is grab the 4 Y bytes found in each 64-bit chunk of the original image, shift them in their appropriate locations in the 32-bit chunk of the grayscale image, then add them up to form the final 32-bit "vector". We need to divide the end condition by 8 this time, because, well, 32-bit pointers are 4 times bigger than 8-bit ones!

This is, indeed, quite a bit faster:

$ ./main
naive_gc                                : 1753.022000 us
naive_precalculate_gc                   : 1481.812000 us
naive_descending_gc                     : 1382.170000 us
large_pointers_gc                       : 723.274000 us

We've almost cut the execution time in half! Let's also make a descending version of that last function:

main.c

void large_pointers_descending_gc (uint8_t * original,
                                   uint8_t * grayscale,
                                   uint32_t original_size)
{
    uint32_t i = original_size / 8;
    uint64_t * p_original = (uint64_t *) original;
    uint32_t * p_grayscale = (uint32_t *) grayscale;

    while (i--)
        p_grayscale[i] =  (p_original[i] & 0x00000000000000ffU) +
                         ((p_original[i] & 0x0000000000ff0000U) >> 8) +
                         ((p_original[i] & 0x000000ff00000000U) >> 16) +
                         ((p_original[i] & 0x00ff000000000000U) >> 24);
}

But is it faster?

$ ./main
naive_gc                                : 1752.600000 us
naive_precalculate_gc                   : 1488.222000 us
naive_descending_gc                     : 1380.016000 us
large_pointers_gc                       : 726.352000 us
large_pointers_descending_gc            : 712.570000 us

Well... yes and no. On my system, it sometimes is slower and sometimes faster, only by a few microseconds, although sometimes (like this one) the difference is greater.

Let's also try partially unrolling the loop: this also cuts down on iterations and, at least in theory, should make our code faster.

main.c

void large_pointers_unrolled_gc (uint8_t * original,
                        uint8_t * grayscale,
                        uint32_t original_size)
{
    uint32_t i = 0;
    uint64_t * p_original = (uint64_t *) original;
    uint32_t * p_grayscale = (uint32_t *) grayscale;
    uint32_t size = original_size / 8;

    for (i = 0; i < size; i += 4) {
        p_grayscale[i] =  (p_original[i] & 0x00000000000000ffU) +
                         ((p_original[i] & 0x0000000000ff0000U) >> 8) +
                         ((p_original[i] & 0x000000ff00000000U) >> 16) +
                         ((p_original[i] & 0x00ff000000000000U) >> 24);

        p_grayscale[i + 1] =  (p_original[i + 1] & 0x00000000000000ffU) +
                             ((p_original[i + 1] & 0x0000000000ff0000U) >> 8) +
                             ((p_original[i + 1] & 0x000000ff00000000U) >> 16) +
                             ((p_original[i + 1] & 0x00ff000000000000U) >> 24);

        p_grayscale[i + 2] =  (p_original[i + 2] & 0x00000000000000ffU) +
                             ((p_original[i + 2] & 0x0000000000ff0000U) >> 8) +
                             ((p_original[i + 2] & 0x000000ff00000000U) >> 16) +
                             ((p_original[i + 2] & 0x00ff000000000000U) >> 24);

        p_grayscale[i + 3] =  (p_original[i + 3] & 0x00000000000000ffU) +
                             ((p_original[i + 3] & 0x0000000000ff0000U) >> 8) +
                             ((p_original[i + 3] & 0x000000ff00000000U) >> 16) +
                             ((p_original[i + 3] & 0x00ff000000000000U) >> 24);
    }
}

Well that sure is uglier... how about speed?

$ ./main
naive_gc                                : 1759.348000 us
naive_precalculate_gc                   : 1485.786000 us
naive_descending_gc                     : 1389.732000 us
large_pointers_gc                       : 719.982000 us
large_pointers_descending_gc            : 713.292000 us
large_pointers_unrolled_gc              : 780.634000 us

So that messy code isn't worth it. It actually makes sense: unrolling does make our code iterate less, but it also adds load to the body of the for-loop. There's a tradeoff between the two, and here we have clearly crossed the line.

Well that's pretty interesting, but this whole time we have been just using our ideas to make our algorithm faster: let's ask the compiler for some help! I'm going to add -O3 in my compiler flags and make everything again. Let's see what that does:

$ make clean all
rm -rf main.s
gcc -Wall -O3 -S main.c
gcc -Wall -O3 main.s -o main
$ ./main
naive_gc                                : 69.988000 us
naive_precalculate_gc                   : 69.516000 us
naive_descending_gc                     : 326.612000 us
large_pointers_gc                       : 162.852000 us
large_pointers_descending_gc            : 331.648000 us
large_pointers_unrolled_gc              : 138.106000 us

Oh how the tables have turned! It seems that almost all of our "optimized" functions are actually way slower than the "naive" function we started with. That's actually pretty unsurprising as well; gcc (as most compilers) does an excellent job of optimizing code when you tell it to. We can see, however, that calculating the end condition once does improve execution time a bit!

These are some very interesting results, but I want to see what happens on an actual MCU; the compiler is not exactly the same, and there's no underlying OS. Let's see if we can come up with better algorithms on an MCU!

Implementing on ARM Cortex-M3

I happen to own a NUCLEO-F103RB board by ST, which will do nicely for this test. If you want to follow along, you can check out Ayoub Sabri's STM32 primer¹, in which he explains how to set up a basic blinky project on STM32, and you can move from there. As for profiling, I'm using Sergey Bashlayev's STM32 Profiler, because it's both lightweight and microsecond-accurate.

Keep in mind that I heavily reduced the image dimensions for this test so as to not trigger any stack overflows; the board I'm using only has 20 Kb of SRAM so I'm reducing the dimensions to 80 by 60. If you want to compare the execution times, keep in mind that the system clock's frequency is 64 MHz. I've compiled with the -O3 flag, and it should be noted that, at least in this case, I see no difference between -O3 and -Ofast. Other than that, the same algorithms are used. Let's see the results:

naive_gc                                : 450.658 us
naive_precalculate_gc                   : 375.646 us
naive_descending_gc                     : 375.616 us
large_pointers_gc                       : 197.45  us
large_pointers_descending_gc            : 216.188 us
large_pointers_unrolled_gc              : 195.386 us

Okay, so we can see that different compilers (here I'm using arm-none-eabi-gcc btw) behave very differently. Now the first naive function is the slowest, and the large pointers seem to make a lot of difference. The best result so far is the unrolled version of the large pointers, so what would happen if we were to unroll... more?

main.c

void large_pointers_unrolled_more_gc (uint8_t * original,
                                      uint8_t * grayscale,
                                      uint32_t original_size)
{
    uint32_t i = 0;
    uint64_t * p_original = (uint64_t *) original;
    uint32_t * p_grayscale = (uint32_t *) grayscale;
    uint32_t size = original_size / 8;

    for (i = 0; i < size; i += 8) {
        p_grayscale[i] =  (p_original[i] & 0x00000000000000ffU) +
                         ((p_original[i] & 0x0000000000ff0000U) >> 8) +
                         ((p_original[i] & 0x000000ff00000000U) >> 16) +
                         ((p_original[i] & 0x00ff000000000000U) >> 24);

        p_grayscale[i + 1] =  (p_original[i + 1] & 0x00000000000000ffU) +
                             ((p_original[i + 1] & 0x0000000000ff0000U) >> 8) +
                             ((p_original[i + 1] & 0x000000ff00000000U) >> 16) +
                             ((p_original[i + 1] & 0x00ff000000000000U) >> 24);

        p_grayscale[i + 2] =  (p_original[i + 2] & 0x00000000000000ffU) +
                             ((p_original[i + 2] & 0x0000000000ff0000U) >> 8) +
                             ((p_original[i + 2] & 0x000000ff00000000U) >> 16) +
                             ((p_original[i + 2] & 0x00ff000000000000U) >> 24);

        p_grayscale[i + 3] =  (p_original[i + 3] & 0x00000000000000ffU) +
                             ((p_original[i + 3] & 0x0000000000ff0000U) >> 8) +
                             ((p_original[i + 3] & 0x000000ff00000000U) >> 16) +
                             ((p_original[i + 3] & 0x00ff000000000000U) >> 24);

        p_grayscale[i + 4] =  (p_original[i + 4] & 0x00000000000000ffU) +
                             ((p_original[i + 4] & 0x0000000000ff0000U) >> 8) +
                             ((p_original[i + 4] & 0x000000ff00000000U) >> 16) +
                             ((p_original[i + 4] & 0x00ff000000000000U) >> 24);

        p_grayscale[i + 5] =  (p_original[i + 5] & 0x00000000000000ffU) +
                             ((p_original[i + 5] & 0x0000000000ff0000U) >> 8) +
                             ((p_original[i + 5] & 0x000000ff00000000U) >> 16) +
                             ((p_original[i + 5] & 0x00ff000000000000U) >> 24);

        p_grayscale[i + 6] =  (p_original[i + 6] & 0x00000000000000ffU) +
                             ((p_original[i + 6] & 0x0000000000ff0000U) >> 8) +
                             ((p_original[i + 6] & 0x000000ff00000000U) >> 16) +
                             ((p_original[i + 6] & 0x00ff000000000000U) >> 24);

        p_grayscale[i + 7] =  (p_original[i + 7] & 0x00000000000000ffU) +
                             ((p_original[i + 7] & 0x0000000000ff0000U) >> 8) +
                             ((p_original[i + 7] & 0x000000ff00000000U) >> 16) +
                             ((p_original[i + 7] & 0x00ff000000000000U) >> 24);
    }
}

Okay, so you should definitely never ever write code like this, but for the sake of being thorough let's run it and see if it's faster:

naive_gc                                : 450.658 us
naive_precalculate_gc                   : 375.646 us
naive_descending_gc                     : 375.616 us
large_pointers_gc                       : 197.45  us
large_pointers_descending_gc            : 216.188 us
large_pointers_unrolled_gc              : 195.386 us
large_pointers_unrolled_more_gc         : 200.178 us

Not really. Yet again, we can see that more unrolling doesn't always mean more optimization.

Conclusions

I think this little experiment has made it very clear that optimization depends heavily on the platform we are working on, meaning the hardware as well as the toolchain we use to generate our binaries; as we saw, two different compilers may optimize things very differently. That said, there are a few optimization techniques we can keep in mind (at least when working on embedded platforms):

Use -O3 or -Ofast. Compilers are, most of the time, very clever about optimizing code. It takes a few keyboard strokes to add a flag to your Makefile, but it can end up cutting down on your execution time by a lot. Use that to your advantage.
Avoid repeating calculations, accessing the memory etc. Most loops will run for quite a few iterations, so all these small overheads add up. Try to have the minimum number of memory read/writes and (especially costly) operations in your loops.
Favor descending loops. We saw that in desktop processors this doesn't seem to help all that much, however in MCUs it makes a world of difference (just compare the execution times of naive_gc and naive_descending_gc when run on the STM32 MCU).
If you really need to cut down on time, give unrolling a try. It might not always work or your compiler might do it automatically when optimizing for speed, but it's worth a try. However, keep in mind that there's a tradeoff between loop iterations and instructions in the loop, so don't overdo it; try and find the sweet spot where it actually makes your code run faster.

Also, a final point on MCUs: be very careful about how you profile your code. Doing things that seem trivial in a desktop processor (such as calls to printf to print some debugging data) can noticeably slow down your code. Also try and use the lowest level of timing possible in order to get maximum precision; the profiler I linked to in the previous section performs extremely well in ARM Cortex-M CPUs, because it uses the debug interface of the CPU which was precisely made for profiling.

If you have any questions or suggestions for additions to this post, don't hesitate to send me an email!

Acknowledgments

I'd like to thank Ayoub for his input on my optimization methodology, as well as his advice on how to efficiently profile code on STM32. :)

Code

In case you want to quickly test this for yourself, here's my full code for the desktop version:

main.c

#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include <string.h>

#define WIDTH            1920
#define HEIGHT           1080
#define TOTAL_PIXELS     (WIDTH * HEIGHT)
#define GRAYSCALE_PIXELS (TOTAL_PIXELS / 2)
#define ITERATIONS       500

typedef void (* gc_t)(uint8_t *, uint8_t *, uint32_t);



/*
 * ================
 * HELPER FUNCTIONS
 * ================
 */

void setup_image (uint8_t * image, uint32_t size)
{
    uint32_t i = 0;

    // Set elements at even indices to 1 and the rest to 0 
    for (i = 0; i < size; i++)
        image[i] = !(i % 2);
}


uint8_t test_conversion (uint8_t * image, uint32_t size)
{
    uint32_t i = 0;

    // Check that all the elements are equal to 1
    for (i = 0; i < size; i++)
        if (!image[i]) return 0;

    return 1;
}


uint8_t time_conversion (const char * name, gc_t gc_fun,
                         uint8_t * original_img, uint8_t * grayscale_img,
                         uint32_t original_size, uint32_t iterations)
{
    int i = 0;
    uint8_t ret = 0;
    clock_t start;
    clock_t stop;
    double average_time = 0.0;

    // First, check that the conversion function works
    gc_fun(original_img, grayscale_img, original_size);
    ret = test_conversion(grayscale_img, original_size / 2);
    if (!ret) {
        printf("ERROR: Conversion failed for %s.\n", name);
        return 0;
    }

    // Time the function for a number of iterations
    start = clock();

    for (i = 0; i < iterations; i++)
        gc_fun(original_img, grayscale_img, original_size);

    stop = clock();

    // Calculate the average time (in microseconds)
    average_time = 1e6 * ((double) stop - start) / CLOCKS_PER_SEC / iterations;
    printf("%-40s: %f us\n", name, average_time);

    // Wipe the grayscale image clean
    memset(grayscale_img, 0, original_size / 2);

    return 1;
}



/**
 * ====================
 * CONVERSION FUNCTIONS
 * ====================
 */

void naive_gc (uint8_t * original, uint8_t * grayscale, uint32_t original_size)
{
    uint32_t i = 0;

    for (i = 0; i < original_size / 2; i++)
        grayscale[i] = original[2 * i];
}


void naive_precalculate_gc (uint8_t * original,
                            uint8_t * grayscale,
                            uint32_t original_size)
{
    uint32_t i = 0;
    uint32_t size = original_size / 2;

    for (i = 0; i < size; i++)
        grayscale[i] = original[2 * i];
}


void naive_descending_gc (uint8_t * original,
                          uint8_t * grayscale,
                          uint32_t original_size)
{
    int32_t i = original_size / 2;

    while (i--)
        grayscale[i] = original[2 * i];
}


void large_pointers_gc (uint8_t * original,
                        uint8_t * grayscale,
                        uint32_t original_size)
{
    uint32_t i = 0;
    uint64_t * p_original = (uint64_t *) original;
    uint32_t * p_grayscale = (uint32_t *) grayscale;
    uint32_t size = original_size / 8;

    for (i = 0; i < size; i++)
        p_grayscale[i] =  (p_original[i] & 0x00000000000000ffU) +
                         ((p_original[i] & 0x0000000000ff0000U) >> 8) +
                         ((p_original[i] & 0x000000ff00000000U) >> 16) +
                         ((p_original[i] & 0x00ff000000000000U) >> 24);
}


void large_pointers_descending_gc (uint8_t * original,
                                   uint8_t * grayscale,
                                   uint32_t original_size)
{
    uint32_t i = original_size / 8;
    uint64_t * p_original = (uint64_t *) original;
    uint32_t * p_grayscale = (uint32_t *) grayscale;

    while (i--)
        p_grayscale[i] =  (p_original[i] & 0x00000000000000ffU) +
                         ((p_original[i] & 0x0000000000ff0000U) >> 8) +
                         ((p_original[i] & 0x000000ff00000000U) >> 16) +
                         ((p_original[i] & 0x00ff000000000000U) >> 24);
}


void large_pointers_unrolled_gc (uint8_t * original,
                        uint8_t * grayscale,
                        uint32_t original_size)
{
    uint32_t i = 0;
    uint64_t * p_original = (uint64_t *) original;
    uint32_t * p_grayscale = (uint32_t *) grayscale;
    uint32_t size = original_size / 8;

    for (i = 0; i < size; i += 4) {
        p_grayscale[i] =  (p_original[i] & 0x00000000000000ffU) +
                         ((p_original[i] & 0x0000000000ff0000U) >> 8) +
                         ((p_original[i] & 0x000000ff00000000U) >> 16) +
                         ((p_original[i] & 0x00ff000000000000U) >> 24);

        p_grayscale[i + 1] =  (p_original[i + 1] & 0x00000000000000ffU) +
                             ((p_original[i + 1] & 0x0000000000ff0000U) >> 8) +
                             ((p_original[i + 1] & 0x000000ff00000000U) >> 16) +
                             ((p_original[i + 1] & 0x00ff000000000000U) >> 24);

        p_grayscale[i + 2] =  (p_original[i + 2] & 0x00000000000000ffU) +
                             ((p_original[i + 2] & 0x0000000000ff0000U) >> 8) +
                             ((p_original[i + 2] & 0x000000ff00000000U) >> 16) +
                             ((p_original[i + 2] & 0x00ff000000000000U) >> 24);

        p_grayscale[i + 3] =  (p_original[i + 3] & 0x00000000000000ffU) +
                             ((p_original[i + 3] & 0x0000000000ff0000U) >> 8) +
                             ((p_original[i + 3] & 0x000000ff00000000U) >> 16) +
                             ((p_original[i + 3] & 0x00ff000000000000U) >> 24);
    }
}



int main (int argc, char * argv[])
{
    // Image data
    uint8_t original_image[TOTAL_PIXELS];
    uint8_t grayscale_image[GRAYSCALE_PIXELS];
    uint8_t ret = 0;
    // Conversion functions
    gc_t gc_list[] = {
        naive_gc,
        naive_precalculate_gc,
        naive_descending_gc,
        large_pointers_gc,
        large_pointers_descending_gc,
        large_pointers_unrolled_gc,
    };

    char * gc_names[] = {
        "naive_gc",
        "naive_precalculate_gc",
        "naive_descending_gc",
        "large_pointers_gc",
        "large_pointers_descending_gc",
        "large_pointers_unrolled_gc",        
    };
    // Counter
    int i = 0;

    setup_image((uint8_t *) original_image, TOTAL_PIXELS);

    for (i = 0; i < sizeof(gc_list) / sizeof(gc_t); i++) {
        ret = time_conversion(gc_names[i], gc_list[i],
                              (uint8_t *) original_image, 
                              (uint8_t *) grayscale_image,
                              TOTAL_PIXELS, ITERATIONS);
        if (!ret) return 1;
    }

    return 0;
}

Unfortunately, the link is now dead :( ↩