Optimizing loops
written on April 9, 2020
Up until quite recently, most of the C code I wrote didn't need to be that optimized: it was either for personal or university projects that ran on proper x86 processors and thus didn't need much low-level optimization. For all of them you could easily get away with good algorithmic design; the goal wasn't to gain a few milliseconds of execution time, but rather to not have your program run for 3-4 hours in order to spit out a result. I wrote my fair share of embedded C code in uni—after all, this is what I majored in—however as with most uni projects there aren't really any heavy constraints, because the goal is to learn, not necessarily to write the most optimized code possible.
In the industry we often do go after the most optimized code. So, inspired by a problem I came across, I'd like to talk about a very well known—and a tad bit controversial (?)—topic in firmware: optimizing loops. So grab a cup of your favorite beverage and come along for the ride!
Getting into context
I could talk about this in an abstract, generic context, however I think problems like these are way more interesting if they actually relate to some real-world problem; after all, this is what engineering is all about.
So, I came across a very simple task: convert images stored in YUV format to simple grayscale. For the people who know what those are, you can jump ahead, but if you don't, here's a simple explanation: YUV is a very well known color encoding system that's used mainly because it presents the same information as RGB, but in a more compressible form. Basically, instead of splitting the colors of an image into three RGB channels (Red, Green and Blue), we split them into one luminance channel (Y) and two chrominance channels (U and V).

I'm not going to go into detail on how that's done or what it means, but the gist of it is that
our human eyes are way more attentive to the luminance of an image rather than its chrominance;
that means that we can compress the U and V channels quite a lot before it becomes
noticeable. If you try to do the same with any of the three RGB channels, it immediately,
visibly deteriorates the image. I guess the way I have intuitively justified
to myself
why that works is because the Y channel is essentially the black and white version of the
image, and in some way it contains the structure
of the image. Another way to put it is
that, if I show you a picture of a dog, you can tell that it's a dog whether the picture is
colored or not; the important information lies in the Y channel, not in the
chrominance of the image.
Now that you get what YUV is, let's talk about how it's expressed in code. There are multiple
different YUV formats for different types of file sizes, compression etc. In this problem,
we're working with YUV422, which simply means that for every 2 units
of
chrominance there's twice as much luminance information. I really don't wanna say
pixels here, because that would be misleading as the 422 actually refers to sampling,
but anyway it just means that we have twice as much Y as we have U and V. In code, the YUV422
images are stored in uint8_t
arrays, and they're laid out as
follows:
uint8_t image[] = { Y, U, Y, V, Y, U, Y, V, Y, U, Y, V, ... }
Of course, each of these elements takes up 8 bits, so we end up with 32 bits or 4 bytes for 2 pixels of image. In order to convert this image to grayscale, all we need to do is cherry-pick the Y elements (basically the even indices of the array) and put them in a new array to create a grayscale image. Let's create a basic setup to test our conversion code!
A basic profiling setup
So, I'm going to create a small setup in order to easily profile my solutions. Here's the
project structure
:
opti-loop/
├── main.c
└── Makefile
The Makefile is pretty simple, I just want it to automatically generate the assembly code as well so that I can check what my code is doing under the hood:
Makefile
CC = gcc
CFLAGS = -Wall -O0
all: main
main: main.s
$(CC) $(CFLAGS) main.s -o main
main.s: main.c
$(CC) $(CFLAGS) -S main.c
clean:
rm -rf main.s
I also omit the -g
option (because that fills up the
assembly file with a bunch of unnecessary things, which will make it harder for me to
understand it), and I force it to not optimize the code (-O0
) just to see what I can do to make things faster by
myself.
Now let's see what the main.c
file is all about. First,
we need to define some constants:
main.c
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include <string.h>
#define WIDTH 1920
#define HEIGHT 1080
#define TOTAL_PIXELS (WIDTH * HEIGHT)
#define GRAYSCALE_PIXELS (TOTAL_PIXELS / 2)
#define ITERATIONS 500
typedef void (* gc_t)(uint8_t *, uint8_t *, uint32_t);
I'm going to use stdint.h
because I eventually want to
test on a microcontroller (MCU) and thus I want to be conservative with the memory. Apart from
that, stdio.h
allows us to get access to printf
, time.h
allows
us to profile our functions and string.h
contains memset
which is handy when you want to set entire blocks of
memory to a certain value.
We obviously need to define the width and height of our image, and from that calculate the total pixels it will consist of (the grayscale version is half the size because, as explained before, we're only keeping the Y channel).
Lastly, ITERATIONS
refer to the number of times each
function will run in order to average out multiple execution times and get more stable results.
Especially in a desktop x86 processor, we really don't know what else could be running and thus
affect, albeit not much, our profiling.
The function pointer I have created on that last line represents the form of our grayscale
converter functions: it takes two uint8_t
pointers
(the image arrays) and a uint32_t
integer (the image
size) as input, and does the conversion in-place without returning anything.
NOTE: I am avoiding dynamic memory allocation here because I want to keep the code bare-metal friendly. You could absolutely do this using non-static data structures.
Then, I need a function to set up my original image
:
main.c
void setup_image (uint8_t * image, uint32_t size)
{
uint32_t i = 0;
// Set elements at even indices to 1 and the rest to 0
for (i = 0; i < size; i++)
image[i] = !(i % 2);
}
I'm just representing all the Y bytes by a 1 and all the U and V bytes by a 0. I'd also like to test if the conversion was successful (you never know, we might screw something up while trying to get it to work faster):
main.c
uint8_t test_conversion (uint8_t * image, uint32_t size)
{
uint32_t i = 0;
// Check that all the elements are equal to 1
for (i = 0; i < size; i++)
if (!image[i]) return 0;
return 1;
}
And finally, I want a way to time my conversion functions, and display their average runtime:
main.c
uint8_t time_conversion (const char * name, gc_t gc_fun,
uint8_t * original_img, uint8_t * grayscale_img,
uint32_t original_size, uint32_t iterations)
{
int i = 0;
uint8_t ret = 0;
clock_t start;
clock_t stop;
double average_time = 0.0;
// First, check that the conversion function works
gc_fun(original_img, grayscale_img, original_size);
ret = test_conversion(grayscale_img, original_size / 2);
if (!ret) {
printf("ERROR: Conversion failed for %s.\n", name);
return 0;
}
// Time the function for a number of iterations
start = clock();
for (i = 0; i < iterations; i++)
gc_fun(original_img, grayscale_img, original_size);
stop = clock();
// Calculate the average time (in microseconds)
average_time = 1e6 * ((double) stop - start) / CLOCKS_PER_SEC / iterations;
printf("%-40s: %f us\n", name, average_time);
// Wipe the grayscale image clean
memset(grayscale_img, 0, original_size / 2);
return 1;
}
This function does a few things:
- It checks that the conversion function passed to it indeed works;
- It times it for a given number of iterations;
- It displays its name and its average execution time;
-
It
resets
the grayscale image by setting all its elements to 0.
Granted, all that setup maybe is a bit annoying but now we can easily test all of our
converters! Suppose we had a bunch of converter functions called my_gc_1
, my_gc_2
,
my_gc_3
etc, then this is what our main()
would look like:
main.c
int main (int argc, char * argv[])
{
// Image data
uint8_t original_image[TOTAL_PIXELS];
uint8_t grayscale_image[GRAYSCALE_PIXELS];
uint8_t ret = 0;
// Conversion functions
gc_t gc_list[] = {
my_gc_1,
my_gc_2,
my_gc_3
};
char * gc_names[] = {
"my_gc_1",
"my_gc_2",
"my_gc_3"
};
// Counter
int i = 0;
setup_image((uint8_t *) original_image, TOTAL_PIXELS);
for (i = 0; i < sizeof(gc_list) / sizeof(gc_t); i++) {
ret = time_conversion(gc_names[i], gc_list[i],
(uint8_t *) original_image,
(uint8_t *) grayscale_image,
TOTAL_PIXELS, ITERATIONS);
if (!ret) return 1;
}
return 0;
}
We basically have an array of pointers to grayscale conversion functions and an array of the
names of those functions, and by passing all that to our time_conversion
function we get instant metrics on how our
functions perform.
That was a lot of setup for such a simple problem! Let's get to actually trying out ideas.
Implementing on x86
We could start by the first thing that comes to mind, which—for me at least—is just a for-loop that picks all the even indices and puts them in the grayscale array:
main.c
void naive_gc (uint8_t * original, uint8_t * grayscale, uint32_t original_size)
{
uint32_t i = 0;
for (i = 0; i < original_size / 2; i++)
grayscale[i] = original[2 * i];
}
Now all I need to do is add it to the array of function pointers and to the array of function
names in main()
:
main.c
// Conversion functions
gc_t gc_list[] = {
naive_gc
};
char * gc_names[] = {
"naive_gc"
};
And now we can finally run the program for the first time:
$ shell
$ make
gcc -Wall -O0 -S main.c
gcc -Wall -O0 main.s -o main
$ ./main
naive_gc : 1731.496000 us
So this naive solution is, as expected, not that great. Even on my 3.7 GHz processor it takes 1.7 milliseconds to run, so I would expect it to be much slower in an MCU. If you don't get why 1.7 ms is slow, the human response time is below 250 ms (I would expect this code to be at least 10 to 50 times more slow on an MCU processor, which generally has a frequency of 10-500 MHz), and embedded systems often deal with hard real-time constraints. Because the processor speed is slower, everything is slower, so gaining a few milliseconds here and there actually makes a lot of difference.
What is slowing the function down? Well, writing the for-loop limit condition as original_size / 2
is certainly convenient for us, the
programmers, but I'm pretty sure the compiler doesn't optimize that out in -O0
mode. Let's look at the assembly code for naive_gc
:
main.s
naive_gc:
.LFB3:
; Function setup
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq %rdi, -24(%rbp)
movq %rsi, -32(%rbp)
movl %edx, -36(%rbp)
; ------------------------------------------------------------------------------
movl $0, -4(%rbp) ; int i = 0
movl $0, -4(%rbp) ; i = 0 (for-loop initialization)
jmp .L17 ; begin the for-loop
.L18:
movl -4(%rbp), %eax ; EAX = i
addl %eax, %eax ; EAX += EAX
movl %eax, %edx ; EDX = EAX
movq -24(%rbp), %rax ; RAX = &original
addq %rdx, %rax ; RAX += RDX (=EDX)
movl -4(%rbp), %ecx ; ECX = i
movq -32(%rbp), %rdx ; RDX = &grayscale
addq %rcx, %rdx ; RDX += RCX (=ECX)
movzbl (%rax), %eax ; EAX = load byte from RAX (extend with 0s)
movb %al, (%rdx) ; RDX = AL (=EAX)
addl $1, -4(%rbp) ; i++
.L17:
movl -36(%rbp), %eax ; EAX = original_size
shrl %eax ; EAX >>= 2
cmpl %eax, -4(%rbp) ; compare(EAX, i) = EAX - i
jb .L18 ; jump if below, aka jump if carry flag = 1
; ------------------------------------------------------------------------------
; End of function
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
The interesting part lies between the lines I've put in the .s
file; the rest is just boilerplate function
prologue/epilogue. Now, as you can probably see, .LFB3
is the initialization
phase of the function, .L18
is the body of the for-loop and .L17
is the for-loop increment/condition test. The latter is
what slows our code down considerably: at the end of each iteration, we need to recalculate the
end condition (remember, original_size / 2
) and compare
it to i
, which typically is done by moving the end
condition number to a register, subtracting i
from it,
and checking the flags register for the result. We will, of course, need to check the value of
i
at each iteration, but there's no need to recalculate
original_size / 2
because its value will not change.
As a first improvement, let's move the end condition to its own variable so that we won't have to recalculate it on each iteration:
main.c
void naive_precalculate_gc (uint8_t * original,
uint8_t * grayscale,
uint32_t original_size)
{
uint32_t i = 0;
uint32_t size = original_size / 2;
for (i = 0; i < size; i++)
grayscale[i] = original[2 * i];
}
Now let's add this new function to our two lists in main()
, make
and run
the program:
$ shell
$ ./main
naive_gc : 1738.254000 us
naive_precalculate_gc : 1482.010000 us
That certainly is faster! Indeed, by checking the assembly code we can now see that the third
function block
contains fewer instructions, as it just has to look in the stack for the
end condition rather than recalculate it.
This is still pretty slow though. Can we get it to go faster? Let's borrow an old firmware trick, descending loops:
main.c
void naive_descending_gc (uint8_t * original,
uint8_t * grayscale,
uint32_t original_size)
{
int32_t i = original_size / 2;
while (i--)
grayscale[i] = original[2 * i];
}
As you can see, this function also calculates the end condition just once; furthermore, it
exploits the fact that 0 is a very special
value. Think about it: when comparing i
to an arbitrary value X
, we need to subtract the one from the other and look at the
result. If we instead compare i
to 0, there are faster
ways to check if they are equal. Let's take a look at the assembly code of the new function:
main.s
naive_descending_gc:
.LFB4:
; Function setup
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq %rdi, -24(%rbp)
movq %rsi, -32(%rbp)
movl %edx, -36(%rbp)
; ------------------------------------------------------------------------------
movl -36(%rbp), %eax ; EAX = original_size
shrl %eax ; EAX >>= 2
movl %eax, -4(%rbp) ; i = EAX (for-loop initialization)
jmp .L20 ; begin the for-loop
.L21:
movl -4(%rbp), %eax ; EAX = i
addl %eax, %eax ; EAX += EAX
movslq %eax, %rdx ; RDX = EAX
movq -24(%rbp), %rax ; RAX = &original
addq %rdx, %rax ; RAX += RDX
movl -4(%rbp), %edx ; EDX = i
movslq %edx, %rcx ; RCX = EDX
movq -32(%rbp), %rdx ; RDX = &grayscale
addq %rcx, %rdx ; RDX += RCX
movzbl (%rax), %eax ; EAX = load byte from RAX (extend with 0s)
movb %al, (%rdx) ; RDX = AL (=EAX)
.L20:
movl -4(%rbp), %eax ; EAX = i
leal -1(%rax), %edx ; EDX = &RAX - 1
movl %edx, -4(%rbp) ; i = EDX
testl %eax, %eax ; if EAX (result of i--) != 0, ZF = 0,
; otherwise ZF = 1
jne .L21 ; jump if ZF = 0
; ------------------------------------------------------------------------------
; End of function
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
As before, the interesting part lies between the lines. Take a look at .L20
: we're now just subtracting one from i
, then running the test
command, which performs an AND
operation between %eax
and itself; this means that if i
is not 0, the result of i--
is not 0, and thus %eax
is not 0. If we then calculate %eax & %eax
, we will certainly get 1, because we are AND
ing a nonzero value with itself. On the contrary, when we
get to i = 0
, this AND
operation will return 0, and thus the loop will be over.
I'm not certain about what I'm going to say, but I'm pretty sure this is faster because a
logical AND
operation (which is what test
uses) is faster than a subtraction (which is what comp
uses). By the way, if you're wondering what the l
is at the end of testl
, that's just the size of the
data being handled. In this case, l
means doubleword
, basically a 32-byte
integer.
Let's add that function to the lists and run our program again:
$ shell
$ ./main
naive_gc : 1770.938000 us
naive_precalculate_gc : 1483.474000 us
naive_descending_gc : 1377.706000 us
And that's indeed faster! We can go faster though.
Another thing you might have noticed is that our loop has quite a few iterations, especially as the resolution of the image increases. A simple way to address this is to lower the amount of iterations by manipulating data with bigger pointers. Let me show you what I mean:
---------------64----------------
-------32-------|-------32-------
8 8 8 8 8 8 8 8
original: | Y | U | Y | V | Y | U | Y | V |
grayscale: | Y | Y | Y | Y | Y | Y | Y | Y |
As you can see, there are as many Ys in 64 bits of the original image as there are Ys in 32 bits of the grayscale image. That means we can do the following:
main.c
void large_pointers_gc (uint8_t * original,
uint8_t * grayscale,
uint32_t original_size)
{
uint32_t i = 0;
uint64_t * p_original = (uint64_t *) original;
uint32_t * p_grayscale = (uint32_t *) grayscale;
uint32_t size = original_size / 8;
for (i = 0; i < size; i++)
p_grayscale[i] = (p_original[i] & 0x00000000000000ffU) +
((p_original[i] & 0x0000000000ff0000U) >> 8) +
((p_original[i] & 0x000000ff00000000U) >> 16) +
((p_original[i] & 0x00ff000000000000U) >> 24);
}
All we need to do is grab the 4 Y bytes found in each 64-bit chunk of the original image, shift
them in their appropriate locations in the 32-bit chunk of the grayscale image, then add them
up to form the final 32-bit vector
. We need to divide the end condition by 8 this time,
because, well, 32-bit pointers are 4 times bigger than 8-bit ones!
This is, indeed, quite a bit faster:
$ shell
$ ./main
naive_gc : 1753.022000 us
naive_precalculate_gc : 1481.812000 us
naive_descending_gc : 1382.170000 us
large_pointers_gc : 723.274000 us
We've almost cut the execution time in half! Let's also make a descending version of that last function:
main.c
void large_pointers_descending_gc (uint8_t * original,
uint8_t * grayscale,
uint32_t original_size)
{
uint32_t i = original_size / 8;
uint64_t * p_original = (uint64_t *) original;
uint32_t * p_grayscale = (uint32_t *) grayscale;
while (i--)
p_grayscale[i] = (p_original[i] & 0x00000000000000ffU) +
((p_original[i] & 0x0000000000ff0000U) >> 8) +
((p_original[i] & 0x000000ff00000000U) >> 16) +
((p_original[i] & 0x00ff000000000000U) >> 24);
}
But is it faster?
$ shell
$ ./main
naive_gc : 1752.600000 us
naive_precalculate_gc : 1488.222000 us
naive_descending_gc : 1380.016000 us
large_pointers_gc : 726.352000 us
large_pointers_descending_gc : 712.570000 us
Well... yes and no. On my system, it sometimes is slower and sometimes faster, only by a few microseconds, although sometimes (like this one) the difference is greater.
Let's also try partially unrolling the loop: this also cuts down on iterations and, at least in theory, should make our code faster.
main.c
void large_pointers_unrolled_gc (uint8_t * original,
uint8_t * grayscale,
uint32_t original_size)
{
uint32_t i = 0;
uint64_t * p_original = (uint64_t *) original;
uint32_t * p_grayscale = (uint32_t *) grayscale;
uint32_t size = original_size / 8;
for (i = 0; i < size; i += 4) {
p_grayscale[i] = (p_original[i] & 0x00000000000000ffU) +
((p_original[i] & 0x0000000000ff0000U) >> 8) +
((p_original[i] & 0x000000ff00000000U) >> 16) +
((p_original[i] & 0x00ff000000000000U) >> 24);
p_grayscale[i + 1] = (p_original[i + 1] & 0x00000000000000ffU) +
((p_original[i + 1] & 0x0000000000ff0000U) >> 8) +
((p_original[i + 1] & 0x000000ff00000000U) >> 16) +
((p_original[i + 1] & 0x00ff000000000000U) >> 24);
p_grayscale[i + 2] = (p_original[i + 2] & 0x00000000000000ffU) +
((p_original[i + 2] & 0x0000000000ff0000U) >> 8) +
((p_original[i + 2] & 0x000000ff00000000U) >> 16) +
((p_original[i + 2] & 0x00ff000000000000U) >> 24);
p_grayscale[i + 3] = (p_original[i + 3] & 0x00000000000000ffU) +
((p_original[i + 3] & 0x0000000000ff0000U) >> 8) +
((p_original[i + 3] & 0x000000ff00000000U) >> 16) +
((p_original[i + 3] & 0x00ff000000000000U) >> 24);
}
}
Well that sure is uglier... how about speed?
$ shell
$ ./main
naive_gc : 1759.348000 us
naive_precalculate_gc : 1485.786000 us
naive_descending_gc : 1389.732000 us
large_pointers_gc : 719.982000 us
large_pointers_descending_gc : 713.292000 us
large_pointers_unrolled_gc : 780.634000 us
So that messy code isn't worth it. It actually makes sense: unrolling does make our code iterate less, but it also adds load to the body of the for-loop. There's a tradeoff between the two, and here we have clearly crossed the line.
Well that's pretty interesting, but this whole time we have been just using our ideas
to make our algorithm faster: let's ask the compiler for some help! I'm going to add -O3
in my compiler flags and make
everything again. Let's see what that does:
$ shell
$ make clean all
rm -rf main.s
gcc -Wall -O3 -S main.c
gcc -Wall -O3 main.s -o main
$ ./main
naive_gc : 69.988000 us
naive_precalculate_gc : 69.516000 us
naive_descending_gc : 326.612000 us
large_pointers_gc : 162.852000 us
large_pointers_descending_gc : 331.648000 us
large_pointers_unrolled_gc : 138.106000 us
Oh how the tables have turned! It seems that almost all of our optimized
functions are
actually way slower than the naive
function we started with. That's actually pretty
unsurprising as well; gcc
(as most compilers) does an
excellent job of optimizing code when you tell it to. We can see, however, that
calculating the end condition once does improve execution time a bit!
These are some very interesting results, but I want to see what happens on an actual MCU; the compiler is not exactly the same, and there's no underlying OS. Let's see if we can come up with better algorithms on an MCU!
Implementing on ARM Cortex-M3
I happen to own a NUCLEO-F103RB board by ST, which will do nicely for this test. If you want to follow along, you can check out Ayoub Sabri's STM32 primer,1 in which he explains how to set up a basic blinky project on STM32, and you can move from there. As for profiling, I'm using Sergey Bashlayev's STM32 Profiler, because it's both lightweight and microsecond-accurate.
Keep in mind that I heavily reduced the image dimensions for this test so as to not trigger any
stack overflows; the board I'm using only has 20 KiB of SRAM so I'm reducing the dimensions to
80 by 60. If you want to compare the execution times, keep in mind that the system clock's
frequency is 64 MHz. I've compiled with the -O3
flag,
and it should be noted that, at least in this case, I see no difference between -O3
and -Ofast
. Other
than that, the same algorithms are used. Let's see the results:
naive_gc : 450.658 us
naive_precalculate_gc : 375.646 us
naive_descending_gc : 375.616 us
large_pointers_gc : 197.45 us
large_pointers_descending_gc : 216.188 us
large_pointers_unrolled_gc : 195.386 us
Okay, so we can see that different compilers (here I'm using arm-none-eabi-gcc
btw) behave very differently. Now
the first naive function is the slowest, and the large pointers seem to make a lot of
difference. The best result so far is the unrolled version of the large pointers, so what would
happen if we were to unroll... more?
main.c
void large_pointers_unrolled_more_gc (uint8_t * original,
uint8_t * grayscale,
uint32_t original_size)
{
uint32_t i = 0;
uint64_t * p_original = (uint64_t *) original;
uint32_t * p_grayscale = (uint32_t *) grayscale;
uint32_t size = original_size / 8;
for (i = 0; i < size; i += 8) {
p_grayscale[i] = (p_original[i] & 0x00000000000000ffU) +
((p_original[i] & 0x0000000000ff0000U) >> 8) +
((p_original[i] & 0x000000ff00000000U) >> 16) +
((p_original[i] & 0x00ff000000000000U) >> 24);
p_grayscale[i + 1] = (p_original[i + 1] & 0x00000000000000ffU) +
((p_original[i + 1] & 0x0000000000ff0000U) >> 8) +
((p_original[i + 1] & 0x000000ff00000000U) >> 16) +
((p_original[i + 1] & 0x00ff000000000000U) >> 24);
p_grayscale[i + 2] = (p_original[i + 2] & 0x00000000000000ffU) +
((p_original[i + 2] & 0x0000000000ff0000U) >> 8) +
((p_original[i + 2] & 0x000000ff00000000U) >> 16) +
((p_original[i + 2] & 0x00ff000000000000U) >> 24);
p_grayscale[i + 3] = (p_original[i + 3] & 0x00000000000000ffU) +
((p_original[i + 3] & 0x0000000000ff0000U) >> 8) +
((p_original[i + 3] & 0x000000ff00000000U) >> 16) +
((p_original[i + 3] & 0x00ff000000000000U) >> 24);
p_grayscale[i + 4] = (p_original[i + 4] & 0x00000000000000ffU) +
((p_original[i + 4] & 0x0000000000ff0000U) >> 8) +
((p_original[i + 4] & 0x000000ff00000000U) >> 16) +
((p_original[i + 4] & 0x00ff000000000000U) >> 24);
p_grayscale[i + 5] = (p_original[i + 5] & 0x00000000000000ffU) +
((p_original[i + 5] & 0x0000000000ff0000U) >> 8) +
((p_original[i + 5] & 0x000000ff00000000U) >> 16) +
((p_original[i + 5] & 0x00ff000000000000U) >> 24);
p_grayscale[i + 6] = (p_original[i + 6] & 0x00000000000000ffU) +
((p_original[i + 6] & 0x0000000000ff0000U) >> 8) +
((p_original[i + 6] & 0x000000ff00000000U) >> 16) +
((p_original[i + 6] & 0x00ff000000000000U) >> 24);
p_grayscale[i + 7] = (p_original[i + 7] & 0x00000000000000ffU) +
((p_original[i + 7] & 0x0000000000ff0000U) >> 8) +
((p_original[i + 7] & 0x000000ff00000000U) >> 16) +
((p_original[i + 7] & 0x00ff000000000000U) >> 24);
}
}
Okay, so you should definitely never ever write code like this, but for the sake of being thorough let's run it and see if it's faster:
naive_gc : 450.658 us
naive_precalculate_gc : 375.646 us
naive_descending_gc : 375.616 us
large_pointers_gc : 197.45 us
large_pointers_descending_gc : 216.188 us
large_pointers_unrolled_gc : 195.386 us
large_pointers_unrolled_more_gc : 200.178 us
Not really. Yet again, we can see that more unrolling doesn't always mean more optimization.
Conclusions
I think this little experiment has made it very clear that optimization depends heavily on the platform we are working on, meaning the hardware as well as the toolchain we use to generate our binaries; as we saw, two different compilers may optimize things very differently. That said, there are a few optimization techniques we can keep in mind (at least when working on embedded platforms):
-
Use
-O3
or-Ofast
. Compilers are, most of the time, very clever about optimizing code. It takes a few keyboard strokes to add a flag to your Makefile, but it can end up cutting down on your execution time by a lot. Use that to your advantage. - Avoid repeating calculations, accessing the memory etc. Most loops will run for quite a few iterations, so all these small overheads add up. Try to have the minimum number of memory read/writes and (especially costly) operations in your loops.
-
Favor descending loops. We saw that in desktop processors this doesn't seem to help all
that much, however in MCUs it makes a world of difference (just compare the execution times
of
naive_gc
andnaive_descending_gc
when run on the STM32 MCU). - If you really need to cut down on time, give unrolling a try. It might not always work or your compiler might do it automatically when optimizing for speed, but it's worth a try. However, keep in mind that there's a tradeoff between loop iterations and instructions in the loop, so don't overdo it; try and find the sweet spot where it actually makes your code run faster.
Also, a final point on MCUs: be very careful about how you profile your code. Doing
things that seem trivial in a desktop processor (such as calls to printf
to print some debugging data) can noticeably slow down
your code. Also try and use the lowest level of timing possible in order to get maximum
precision; the profiler I linked to in the previous section performs extremely well in ARM
Cortex-M CPUs, because it uses the debug interface of the CPU which was precisely made for
profiling.
If you have any questions or suggestions for additions to this post, don't hesitate to send me an email!
Acknowledgments
I'd like to thank Ayoub Sabri for his input on my optimization methodology, as well as his advice on how to efficiently profile code on STM32. :)
Code
In case you want to quickly test this for yourself, here's my full code for the desktop version:
main.c
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include <string.h>
#define WIDTH 1920
#define HEIGHT 1080
#define TOTAL_PIXELS (WIDTH * HEIGHT)
#define GRAYSCALE_PIXELS (TOTAL_PIXELS / 2)
#define ITERATIONS 500
typedef void (* gc_t)(uint8_t *, uint8_t *, uint32_t);
/*
* ================
* HELPER FUNCTIONS
* ================
*/
void setup_image (uint8_t * image, uint32_t size)
{
uint32_t i = 0;
// Set elements at even indices to 1 and the rest to 0
for (i = 0; i < size; i++)
image[i] = !(i % 2);
}
uint8_t test_conversion (uint8_t * image, uint32_t size)
{
uint32_t i = 0;
// Check that all the elements are equal to 1
for (i = 0; i < size; i++)
if (!image[i]) return 0;
return 1;
}
uint8_t time_conversion (const char * name, gc_t gc_fun,
uint8_t * original_img, uint8_t * grayscale_img,
uint32_t original_size, uint32_t iterations)
{
int i = 0;
uint8_t ret = 0;
clock_t start;
clock_t stop;
double average_time = 0.0;
// First, check that the conversion function works
gc_fun(original_img, grayscale_img, original_size);
ret = test_conversion(grayscale_img, original_size / 2);
if (!ret) {
printf("ERROR: Conversion failed for %s.\n", name);
return 0;
}
// Time the function for a number of iterations
start = clock();
for (i = 0; i < iterations; i++)
gc_fun(original_img, grayscale_img, original_size);
stop = clock();
// Calculate the average time (in microseconds)
average_time = 1e6 * ((double) stop - start) / CLOCKS_PER_SEC / iterations;
printf("%-40s: %f us\n", name, average_time);
// Wipe the grayscale image clean
memset(grayscale_img, 0, original_size / 2);
return 1;
}
/**
* ====================
* CONVERSION FUNCTIONS
* ====================
*/
void naive_gc (uint8_t * original, uint8_t * grayscale, uint32_t original_size)
{
uint32_t i = 0;
for (i = 0; i < original_size / 2; i++)
grayscale[i] = original[2 * i];
}
void naive_precalculate_gc (uint8_t * original,
uint8_t * grayscale,
uint32_t original_size)
{
uint32_t i = 0;
uint32_t size = original_size / 2;
for (i = 0; i < size; i++)
grayscale[i] = original[2 * i];
}
void naive_descending_gc (uint8_t * original,
uint8_t * grayscale,
uint32_t original_size)
{
int32_t i = original_size / 2;
while (i--)
grayscale[i] = original[2 * i];
}
void large_pointers_gc (uint8_t * original,
uint8_t * grayscale,
uint32_t original_size)
{
uint32_t i = 0;
uint64_t * p_original = (uint64_t *) original;
uint32_t * p_grayscale = (uint32_t *) grayscale;
uint32_t size = original_size / 8;
for (i = 0; i < size; i++)
p_grayscale[i] = (p_original[i] & 0x00000000000000ffU) +
((p_original[i] & 0x0000000000ff0000U) >> 8) +
((p_original[i] & 0x000000ff00000000U) >> 16) +
((p_original[i] & 0x00ff000000000000U) >> 24);
}
void large_pointers_descending_gc (uint8_t * original,
uint8_t * grayscale,
uint32_t original_size)
{
uint32_t i = original_size / 8;
uint64_t * p_original = (uint64_t *) original;
uint32_t * p_grayscale = (uint32_t *) grayscale;
while (i--)
p_grayscale[i] = (p_original[i] & 0x00000000000000ffU) +
((p_original[i] & 0x0000000000ff0000U) >> 8) +
((p_original[i] & 0x000000ff00000000U) >> 16) +
((p_original[i] & 0x00ff000000000000U) >> 24);
}
void large_pointers_unrolled_gc (uint8_t * original,
uint8_t * grayscale,
uint32_t original_size)
{
uint32_t i = 0;
uint64_t * p_original = (uint64_t *) original;
uint32_t * p_grayscale = (uint32_t *) grayscale;
uint32_t size = original_size / 8;
for (i = 0; i < size; i += 4) {
p_grayscale[i] = (p_original[i] & 0x00000000000000ffU) +
((p_original[i] & 0x0000000000ff0000U) >> 8) +
((p_original[i] & 0x000000ff00000000U) >> 16) +
((p_original[i] & 0x00ff000000000000U) >> 24);
p_grayscale[i + 1] = (p_original[i + 1] & 0x00000000000000ffU) +
((p_original[i + 1] & 0x0000000000ff0000U) >> 8) +
((p_original[i + 1] & 0x000000ff00000000U) >> 16) +
((p_original[i + 1] & 0x00ff000000000000U) >> 24);
p_grayscale[i + 2] = (p_original[i + 2] & 0x00000000000000ffU) +
((p_original[i + 2] & 0x0000000000ff0000U) >> 8) +
((p_original[i + 2] & 0x000000ff00000000U) >> 16) +
((p_original[i + 2] & 0x00ff000000000000U) >> 24);
p_grayscale[i + 3] = (p_original[i + 3] & 0x00000000000000ffU) +
((p_original[i + 3] & 0x0000000000ff0000U) >> 8) +
((p_original[i + 3] & 0x000000ff00000000U) >> 16) +
((p_original[i + 3] & 0x00ff000000000000U) >> 24);
}
}
int main (int argc, char * argv[])
{
// Image data
uint8_t original_image[TOTAL_PIXELS];
uint8_t grayscale_image[GRAYSCALE_PIXELS];
uint8_t ret = 0;
// Conversion functions
gc_t gc_list[] = {
naive_gc,
naive_precalculate_gc,
naive_descending_gc,
large_pointers_gc,
large_pointers_descending_gc,
large_pointers_unrolled_gc,
};
char * gc_names[] = {
"naive_gc",
"naive_precalculate_gc",
"naive_descending_gc",
"large_pointers_gc",
"large_pointers_descending_gc",
"large_pointers_unrolled_gc",
};
// Counter
int i = 0;
setup_image((uint8_t *) original_image, TOTAL_PIXELS);
for (i = 0; i < sizeof(gc_list) / sizeof(gc_t); i++) {
ret = time_conversion(gc_names[i], gc_list[i],
(uint8_t *) original_image,
(uint8_t *) grayscale_image,
TOTAL_PIXELS, ITERATIONS);
if (!ret) return 1;
}
return 0;
}
-
Unfortunately, the link is now dead :( ↩