Optimizing loops
written on April 9, 2020
categories: engineering, embedded
Up until quite recently, most of the C code I wrote didn't need to be that optimized: it was either for personal or university projects that ran on proper x86 processors and thus didn't need much low-level optimization. For all of them you could easily get away with good algorithmic design; the goal wasn't to gain a few milliseconds of execution time, but rather to not have your program run for 3-4 hours in order to spit out a result. I wrote my fair share of embedded C code in uni — after all, this is what I majored in —, however as with most uni projects there aren't really any heavy constraints, because the goal is to learn, not necessarily to write the most optimized code possible.
However, in the industry we often do go after the most optimized code. So, inspired by a problem I came across, I'd like to talk about a very well known — and a tad bit controversial (?) — topic in firmware: optimizing loops. So grab a cup of your favorite beverage and come along for the ride!
Getting into context
I could talk about this in an abstract, generic context, however I think problems like these are way more interesting if they actually relate to some real-world problem; after all, this is what engineering is all about.
So, I came across a very simple task: convert images stored in YUV format to simple grayscale. For the people who know what those are, you can jump ahead, but if you don't, here's a simple explanation: YUV is a very well known color encoding system that's used mainly because it presents the same information as RGB, but in a more compressible form. Basically, instead of splitting the colors of an image into three RGB channels (Red, Green and Blue), we split them into one luminance channel (Y) and two chrominance channels (U and V).
I'm not going to go into detail on how that's done or what it means, but the gist of it is that our human eyes are way more attentive to the luminance of an image rather than its chrominance; that means that we can compress the U and V channels quite a lot before it becomes noticeable. If you try to do the same with any of the three RGB channels, it immediately, visibly deteriorates the image. I guess the way I have intuitively "justified" to myself why that works is because the Y channel is essentially the black & white version of the image, and in some way it contains the "structure" of the image. Another way to put it is that, if I show you a picture of a dog, you can tell that it's a dog whether the picture is colored or not; the important information lies in the Y channel, not in the chrominance of the image.
Now that you get what YUV is, let's talk about how it's expressed in code. There are
multiple different YUV formats for different types of file sizes, compression etc. In
this problem, we're working with YUV422, which simply means that for every 2 "units"
of chrominance there's twice as much luminance information. I really don't wanna say
pixels here, because that would be misleading as the 422 actually refers to sampling,
but anyway it just means that we have twice as much Y as we have U and V. In code, the
YUV422 images are stored in uint8_t
arrays, and they're laid out as follows:
uint8_t image[] = { Y, U, Y, V, Y, U, Y, V, Y, U, Y, V, ... }
Of course, each of these elements takes up 8 bits, so we end up with 32 bits or 4 bytes for 2 pixels of image. In order to convert this image to grayscale, all we need to do is cherry-pick the Y elements (basically the even indices of the array) and put them in a new array to create a grayscale image. Let's create a basic setup to test our conversion code!
A basic profiling setup
So, I'm going to create a small setup in order to easily profile my solutions. Here's the "project structure":
└── opti-loop/
  ├── main.c
  └── Makefile
The Makefile is pretty simple, I just want it to automatically generate the assembly code as well so that I can check what my code is doing under the hood:
Makefile
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
I also omit the -g
option (because that fills up the assembly file with a bunch of
unnecessary things, which will make it harder for me to understand the assembly code),
and I force it to not optimize the code (-O0
) just to see what I can do to make things
faster by myself.
Now let's see what the main.c
file is all about. First, we need to define some
constants:
main.c
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include <string.h>
#define WIDTH 1920
#define HEIGHT 1080
#define TOTAL_PIXELS (WIDTH * HEIGHT)
#define GRAYSCALE_PIXELS (TOTAL_PIXELS / 2)
#define ITERATIONS 500
typedef void (* gc_t)(uint8_t *, uint8_t *, uint32_t);
I'm going to use stdint.h
because I eventually want to test on an MCU and thus I want
to be conservative with the memory. Apart from that, stdio.h
allows us to get access
to printf
, time.h
allows us to profile our functions and string.h
contains
memset
which is handy when you want to set entire blocks of memory to a certain value.
We obviously need to define the width and height of our image, and from that calculate the total pixels it will consist of (the grayscale version is half the size because, as explained before, we're only keeping the Y channel).
Lastly, ITERATIONS
refer to the number of times each function will run in order to
average out multiple execution times and get more stable results. Especially in a
desktop x86 processor, we really don't know what else could be running and thus affect,
albeit not much, our profiling.
The function pointer I have created on that last line represents the form of our
grayscale converter functions: it takes two uint8_t
pointers (the image arrays) and
a uint32_t
integer (the image size) as input, and does the conversion in-place without
returning anything.
NOTE: I am avoiding dynamic memory allocation here because I want to keep the code bare-metal friendly. You could absolutely do this using non-static data structures.
Then, I need a function to set up my original "image":
main.c
void setup_image (uint8_t * image, uint32_t size)
{
uint32_t i = 0;
// Set elements at even indices to 1 and the rest to 0
for (i = 0; i < size; i++)
image[i] = !(i % 2);
}
I'm just representing all the Y bytes by a 1 and all the U and V bytes by a 0. I'd also like to test if the conversion was successful (you never know, we might screw something up while trying to get it to work faster):
main.c
uint8_t test_conversion (uint8_t * image, uint32_t size)
{
uint32_t i = 0;
// Check that all the elements are equal to 1
for (i = 0; i < size; i++)
if (!image[i]) return 0;
return 1;
}
And finally, I want a way to time my conversion functions, and display their average runtime:
main.c
uint8_t time_conversion (const char * name, gc_t gc_fun,
uint8_t * original_img, uint8_t * grayscale_img,
uint32_t original_size, uint32_t iterations)
{
int i = 0;
uint8_t ret = 0;
clock_t start;
clock_t stop;
double average_time = 0.0;
// First, check that the conversion function works
gc_fun(original_img, grayscale_img, original_size);
ret = test_conversion(grayscale_img, original_size / 2);
if (!ret) {
printf("ERROR: Conversion failed for %s.\n", name);
return 0;
}
// Time the function for a number of iterations
start = clock();
for (i = 0; i < iterations; i++)
gc_fun(original_img, grayscale_img, original_size);
stop = clock();
// Calculate the average time (in microseconds)
average_time = 1e6 * ((double) stop - start) / CLOCKS_PER_SEC / iterations;
printf("%-40s: %f us\n", name, average_time);
// Wipe the grayscale image clean
memset(grayscale_img, 0, original_size / 2);
return 1;
}
This function does a few things:
- It checks that the conversion function passed to it indeed works
- It times it for a given number of iterations
- It displays its name and its average execution time
- It "resets" the grayscale image by setting all its elements to 0
Granted, all that setup maybe is a bit annoying but now we can easily test all of our
converters! Suppose we had a bunch of converter functions called my_gc_1
, my_gc_2
,
my_gc_3
etc, then this is how our main()
would look like:
main.c
int main (int argc, char * argv[])
{
// Image data
uint8_t original_image[TOTAL_PIXELS];
uint8_t grayscale_image[GRAYSCALE_PIXELS];
uint8_t ret = 0;
// Conversion functions
gc_t gc_list[] = {
my_gc_1,
my_gc_2,
my_gc_3
};
char * gc_names[] = {
"my_gc_1",
"my_gc_2",
"my_gc_3"
};
// Counter
int i = 0;
setup_image((uint8_t *) original_image, TOTAL_PIXELS);
for (i = 0; i < sizeof(gc_list) / sizeof(gc_t); i++) {
ret = time_conversion(gc_names[i], gc_list[i],
(uint8_t *) original_image,
(uint8_t *) grayscale_image,
TOTAL_PIXELS, ITERATIONS);
if (!ret) return 1;
}
return 0;
}
We basically have an array of pointers to grayscale conversion functions and an array of
the names of those functions, and by passing all that to our time_conversion
function
we get instant metrics on how our functions perform.
That was a lot of setup for such a simple problem! Let's get to actually trying out ideas.
Implementing on x86
We could start by the first thing that comes to mind, which — for me at least — is just a for-loop that picks all the even indices and puts them in the grayscale array:
main.c
void naive_gc (uint8_t * original, uint8_t * grayscale, uint32_t original_size)
{
uint32_t i = 0;
for (i = 0; i < original_size / 2; i++)
grayscale[i] = original[2 * i];
}
Now all I need to do is add it to the array of function pointers and to the array of
function names in main()
:
main.c
// Conversion functions
gc_t gc_list[] = {
naive_gc
};
char * gc_names[] = {
"naive_gc"
};
And now we can finally run the program for the first time:
$ make
gcc -Wall -O0 -S main.c
gcc -Wall -O0 main.s -o main
$ ./main
naive_gc : 1731.496000 us
So this naive solution is, as expected, not that great. Even on my 3.7 GHz processor it takes 1.7 milliseconds to run, so I would expect it to be much slower in an MCU. If you don't get why 1.7 ms is slow, the human response time is below 250 ms (I would expect this code to be at least 10 to 50 times more slow on an MCU processor, which generally has a frequency of 10-500 MHz), and embedded systems often deal with hard real-time constraints. Because the processor speed is slower, everything is slower, so gaining a few milliseconds here and there actually makes a lot of difference.
What is slowing the function down? Well, writing the for-loop limit condition as
original_size / 2
is certainly convenient for us, the programmers, but I'm pretty sure
the compiler doesn't optimize that out in -O0
mode. Let's look at the assembly code
for naive_gc
:
main.s
naive_gc:
.LFB3:
; Function setup
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq %rdi, -24(%rbp)
movq %rsi, -32(%rbp)
movl %edx, -36(%rbp)
; ------------------------------------------------------------------------------
movl $0, -4(%rbp) ; int i = 0
movl $0, -4(%rbp) ; i = 0 (for-loop initialization)
jmp .L17 ; begin the for-loop
.L18:
movl -4(%rbp), %eax ; EAX = i
addl %eax, %eax ; EAX += EAX
movl %eax, %edx ; EDX = EAX
movq -24(%rbp), %rax ; RAX = &original
addq %rdx, %rax ; RAX += RDX (=EDX)
movl -4(%rbp), %ecx ; ECX = i
movq -32(%rbp), %rdx ; RDX = &grayscale
addq %rcx, %rdx ; RDX += RCX (=ECX)
movzbl (%rax), %eax ; EAX = load byte from RAX (extend with 0s)
movb %al, (%rdx) ; RDX = AL (=EAX)
addl $1, -4(%rbp) ; i++
.L17:
movl -36(%rbp), %eax ; EAX = original_size
shrl %eax ; EAX >>= 2
cmpl %eax, -4(%rbp) ; compare(EAX, i) = EAX - i
jb .L18 ; jump if below, aka jump if carry flag = 1
; ------------------------------------------------------------------------------
; End of function
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
The interesting part lies between the lines I've put in the .s
file; the rest is just
boilerplate function prologue/epilogue. Now, as you can probably see, .LFB3
is the
"initialization" phase of the function, .L18
is the body of the for-loop and .L17
is the for-loop increment/condition test. The latter is what slows our code down
considerably: at the end of each iteration, we need to recalculate the end condition
(remember, original_size / 2
) and compare it to i
, which typically is done by moving
the end condition number to a register, subtracting i
from it, and checking the flags
register for the result. We will, of course, need to check the value of i
at each
iteration, but there's no need to recalculate original_size / 2
because its value will
not change.
As a first improvement, let's move the end condition to its own variable so that we won't have to recalculate it on each iteration:
main.c
void naive_precalculate_gc (uint8_t * original,
uint8_t * grayscale,
uint32_t original_size)
{
uint32_t i = 0;
uint32_t size = original_size / 2;
for (i = 0; i < size; i++)
grayscale[i] = original[2 * i];
}
Now let's add this new function to our two lists in main()
, make
and run the
program:
$ ./main
naive_gc : 1738.254000 us
naive_precalculate_gc : 1482.010000 us
That certainly is faster! Indeed, by checking the assembly code we can now see that the third function "block" contains fewer instructions, as it just has to look in the stack for the end condition rather than recalculate it.
This is still pretty slow though. Can we get it to go faster? Let's borrow an old firmware trick, descending loops:
main.c
void naive_descending_gc (uint8_t * original,
uint8_t * grayscale,
uint32_t original_size)
{
int32_t i = original_size / 2;
while (i--)
grayscale[i] = original[2 * i];
}
As you can see, this function also calculates the end condition just once; furthermore,
it exploits the fact that 0 is a very "special" value. Think about it: when comparing
i
to an arbitrary value X
, we need to subtract the one from the other and look at
the result. If we instead compare i
to 0, there are faster ways to check if they are
equal. Let's take a look at the assembly code of the new function:
main.s
naive_descending_gc:
.LFB4:
; Function setup
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq %rdi, -24(%rbp)
movq %rsi, -32(%rbp)
movl %edx, -36(%rbp)
; ------------------------------------------------------------------------------
movl -36(%rbp), %eax ; EAX = original_size
shrl %eax ; EAX >>= 2
movl %eax, -4(%rbp) ; i = EAX (for-loop initialization)
jmp .L20 ; begin the for-loop
.L21:
movl -4(%rbp), %eax ; EAX = i
addl %eax, %eax ; EAX += EAX
movslq %eax, %rdx ; RDX = EAX
movq -24(%rbp), %rax ; RAX = &original
addq %rdx, %rax ; RAX += RDX
movl -4(%rbp), %edx ; EDX = i
movslq %edx, %rcx ; RCX = EDX
movq -32(%rbp), %rdx ; RDX = &grayscale
addq %rcx, %rdx ; RDX += RCX
movzbl (%rax), %eax ; EAX = load byte from RAX (extend with 0s)
movb %al, (%rdx) ; RDX = AL (=EAX)
.L20:
movl -4(%rbp), %eax ; EAX = i
leal -1(%rax), %edx ; EDX = &RAX - 1
movl %edx, -4(%rbp) ; i = EDX
testl %eax, %eax ; if EAX (result of i--) != 0, ZF = 0,
; otherwise ZF = 1
jne .L21 ; jump if ZF = 0
; ------------------------------------------------------------------------------
; End of function
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
As before, the interesting part lies between the lines. Take a look at .L20
: we're now
just subtracting one from i
, then running the test
command, which performs an AND
operation between %eax
and itself; this means that if i
is not 0, the result of
i--
is not 0, and thus %eax
is not 0. If we then calculate %eax & %eax
, we will
certainly get 1, because we are AND
ing a nonzero value with itself. On the contrary,
when we get to i = 0
, this AND
operation will return 0, and thus the loop will be
over.
I'm not certain about what I'm going to say, but I'm pretty sure this is faster because
a logical AND
operation (which is what test
uses) is faster than a subtraction
(which is what comp
uses). By the way, if you're wondering what the "l" is at the end
of testl
, that's just the size of the data being handled. In this case, "l" means
"doubleword", basically a 32-byte integer.
Let's add that function to the lists and run our program again:
$ ./main
naive_gc : 1770.938000 us
naive_precalculate_gc : 1483.474000 us
naive_descending_gc : 1377.706000 us
And that's indeed faster! We can go faster though.
Another thing you might have noticed is that our loop has quite a few iterations, especially as the resolution of the image increases. A simple way to address this is to lower the amount of iterations by manipulating data with bigger pointers. Let me show you what I mean:
---------------64----------------
-------32-------|-------32-------
8 8 8 8 8 8 8 8
original: | Y | U | Y | V | Y | U | Y | V |
grayscale: | Y | Y | Y | Y | Y | Y | Y | Y |
As you can see, there are as many Ys in 64 bits of the original image as there are Ys in 32 bits of the grayscale image. That means we can do the following:
main.c
void large_pointers_gc (uint8_t * original,
uint8_t * grayscale,
uint32_t original_size)
{
uint32_t i = 0;
uint64_t * p_original = (uint64_t *) original;
uint32_t * p_grayscale = (uint32_t *) grayscale;
uint32_t size = original_size / 8;
for (i = 0; i < size; i++)
p_grayscale[i] = (p_original[i] & 0x00000000000000ffU) +
((p_original[i] & 0x0000000000ff0000U) >> 8) +
((p_original[i] & 0x000000ff00000000U) >> 16) +
((p_original[i] & 0x00ff000000000000U) >> 24);
}
All we need to do is grab the 4 Y bytes found in each 64-bit chunk of the original image, shift them in their appropriate locations in the 32-bit chunk of the grayscale image, then add them up to form the final 32-bit "vector". We need to divide the end condition by 8 this time, because, well, 32-bit pointers are 4 times bigger than 8-bit ones!
This is, indeed, quite a bit faster:
$ ./main
naive_gc : 1753.022000 us
naive_precalculate_gc : 1481.812000 us
naive_descending_gc : 1382.170000 us
large_pointers_gc : 723.274000 us
We've almost cut the execution time in half! Let's also make a descending version of that last function:
main.c
void large_pointers_descending_gc (uint8_t * original,
uint8_t * grayscale,
uint32_t original_size)
{
uint32_t i = original_size / 8;
uint64_t * p_original = (uint64_t *) original;
uint32_t * p_grayscale = (uint32_t *) grayscale;
while (i--)
p_grayscale[i] = (p_original[i] & 0x00000000000000ffU) +
((p_original[i] & 0x0000000000ff0000U) >> 8) +
((p_original[i] & 0x000000ff00000000U) >> 16) +
((p_original[i] & 0x00ff000000000000U) >> 24);
}
But is it faster?
$ ./main
naive_gc : 1752.600000 us
naive_precalculate_gc : 1488.222000 us
naive_descending_gc : 1380.016000 us
large_pointers_gc : 726.352000 us
large_pointers_descending_gc : 712.570000 us
Well... yes and no. On my system, it sometimes is slower and sometimes faster, only by a few microseconds, although sometimes (like this one) the difference is greater.
Let's also try partially unrolling the loop: this also cuts down on iterations and, at least in theory, should make our code faster.
main.c
void large_pointers_unrolled_gc (uint8_t * original,
uint8_t * grayscale,
uint32_t original_size)
{
uint32_t i = 0;
uint64_t * p_original = (uint64_t *) original;
uint32_t * p_grayscale = (uint32_t *) grayscale;
uint32_t size = original_size / 8;
for (i = 0; i < size; i += 4) {
p_grayscale[i] = (p_original[i] & 0x00000000000000ffU) +
((p_original[i] & 0x0000000000ff0000U) >> 8) +
((p_original[i] & 0x000000ff00000000U) >> 16) +
((p_original[i] & 0x00ff000000000000U) >> 24);
p_grayscale[i + 1] = (p_original[i + 1] & 0x00000000000000ffU) +
((p_original[i + 1] & 0x0000000000ff0000U) >> 8) +
((p_original[i + 1] & 0x000000ff00000000U) >> 16) +
((p_original[i + 1] & 0x00ff000000000000U) >> 24);
p_grayscale[i + 2] = (p_original[i + 2] & 0x00000000000000ffU) +
((p_original[i + 2] & 0x0000000000ff0000U) >> 8) +
((p_original[i + 2] & 0x000000ff00000000U) >> 16) +
((p_original[i + 2] & 0x00ff000000000000U) >> 24);
p_grayscale[i + 3] = (p_original[i + 3] & 0x00000000000000ffU) +
((p_original[i + 3] & 0x0000000000ff0000U) >> 8) +
((p_original[i + 3] & 0x000000ff00000000U) >> 16) +
((p_original[i + 3] & 0x00ff000000000000U) >> 24);
}
}
Well that sure is uglier... how about speed?
$ ./main
naive_gc : 1759.348000 us
naive_precalculate_gc : 1485.786000 us
naive_descending_gc : 1389.732000 us
large_pointers_gc : 719.982000 us
large_pointers_descending_gc : 713.292000 us
large_pointers_unrolled_gc : 780.634000 us
So that messy code isn't worth it. It actually makes sense: unrolling does make our code iterate less, but it also adds load to the body of the for-loop. There's a tradeoff between the two, and here we have clearly crossed the line.
Well that's pretty interesting, but this whole time we have been just using our ideas
to make our algorithm faster: let's ask the compiler for some help! I'm going to add
-O3
in my compiler flags and make
everything again. Let's see what that does:
$ make clean all
rm -rf main.s
gcc -Wall -O3 -S main.c
gcc -Wall -O3 main.s -o main
$ ./main
naive_gc : 69.988000 us
naive_precalculate_gc : 69.516000 us
naive_descending_gc : 326.612000 us
large_pointers_gc : 162.852000 us
large_pointers_descending_gc : 331.648000 us
large_pointers_unrolled_gc : 138.106000 us
Oh how the tables have turned! It seems that almost all of our "optimized" functions are
actually way slower than the "naive" function we started with. That's actually pretty
unsurprising as well; gcc
(as most compilers) does an excellent job of optimizing
code when you tell it to. We can see, however, that calculating the end condition once
does improve execution time a bit!
These are some very interesting results, but I want to see what happens on an actual MCU; the compiler is not exactly the same, and there's no underlying OS. Let's see if we can come up with better algorithms on an MCU!
Implementing on ARM Cortex-M3
I happen to own a NUCLEO-F103RB board by ST, which will do nicely for this test. If you want to follow along, you can check out Ayoub Sabri's STM32 primer1, in which he explains how to set up a basic blinky project on STM32, and you can move from there. As for profiling, I'm using Sergey Bashlayev's STM32 Profiler, because it's both lightweight and microsecond-accurate.
Keep in mind that I heavily reduced the image dimensions for this test so as to not
trigger any stack overflows; the board I'm using only has 20 Kb of SRAM so I'm reducing
the dimensions to 80 by 60. If you want to compare the execution times, keep in mind
that the system clock's frequency is 64 MHz. I've compiled with the -O3
flag, and it
should be noted that, at least in this case, I see no difference between -O3
and
-Ofast
. Other than that, the same algorithms are used. Let's see the results:
naive_gc : 450.658 us
naive_precalculate_gc : 375.646 us
naive_descending_gc : 375.616 us
large_pointers_gc : 197.45 us
large_pointers_descending_gc : 216.188 us
large_pointers_unrolled_gc : 195.386 us
Okay, so we can see that different compilers (here I'm using arm-none-eabi-gcc
btw)
behave very differently. Now the first naive function is the slowest, and the large
pointers seem to make a lot of difference. The best result so far is the unrolled
version of the large pointers, so what would happen if we were to unroll... more?
main.c
void large_pointers_unrolled_more_gc (uint8_t * original,
uint8_t * grayscale,
uint32_t original_size)
{
uint32_t i = 0;
uint64_t * p_original = (uint64_t *) original;
uint32_t * p_grayscale = (uint32_t *) grayscale;
uint32_t size = original_size / 8;
for (i = 0; i < size; i += 8) {
p_grayscale[i] = (p_original[i] & 0x00000000000000ffU) +
((p_original[i] & 0x0000000000ff0000U) >> 8) +
((p_original[i] & 0x000000ff00000000U) >> 16) +
((p_original[i] & 0x00ff000000000000U) >> 24);
p_grayscale[i + 1] = (p_original[i + 1] & 0x00000000000000ffU) +
((p_original[i + 1] & 0x0000000000ff0000U) >> 8) +
((p_original[i + 1] & 0x000000ff00000000U) >> 16) +
((p_original[i + 1] & 0x00ff000000000000U) >> 24);
p_grayscale[i + 2] = (p_original[i + 2] & 0x00000000000000ffU) +
((p_original[i + 2] & 0x0000000000ff0000U) >> 8) +
((p_original[i + 2] & 0x000000ff00000000U) >> 16) +
((p_original[i + 2] & 0x00ff000000000000U) >> 24);
p_grayscale[i + 3] = (p_original[i + 3] & 0x00000000000000ffU) +
((p_original[i + 3] & 0x0000000000ff0000U) >> 8) +
((p_original[i + 3] & 0x000000ff00000000U) >> 16) +
((p_original[i + 3] & 0x00ff000000000000U) >> 24);
p_grayscale[i + 4] = (p_original[i + 4] & 0x00000000000000ffU) +
((p_original[i + 4] & 0x0000000000ff0000U) >> 8) +
((p_original[i + 4] & 0x000000ff00000000U) >> 16) +
((p_original[i + 4] & 0x00ff000000000000U) >> 24);
p_grayscale[i + 5] = (p_original[i + 5] & 0x00000000000000ffU) +
((p_original[i + 5] & 0x0000000000ff0000U) >> 8) +
((p_original[i + 5] & 0x000000ff00000000U) >> 16) +
((p_original[i + 5] & 0x00ff000000000000U) >> 24);
p_grayscale[i + 6] = (p_original[i + 6] & 0x00000000000000ffU) +
((p_original[i + 6] & 0x0000000000ff0000U) >> 8) +
((p_original[i + 6] & 0x000000ff00000000U) >> 16) +
((p_original[i + 6] & 0x00ff000000000000U) >> 24);
p_grayscale[i + 7] = (p_original[i + 7] & 0x00000000000000ffU) +
((p_original[i + 7] & 0x0000000000ff0000U) >> 8) +
((p_original[i + 7] & 0x000000ff00000000U) >> 16) +
((p_original[i + 7] & 0x00ff000000000000U) >> 24);
}
}
Okay, so you should definitely never ever write code like this, but for the sake of being thorough let's run it and see if it's faster:
naive_gc : 450.658 us
naive_precalculate_gc : 375.646 us
naive_descending_gc : 375.616 us
large_pointers_gc : 197.45 us
large_pointers_descending_gc : 216.188 us
large_pointers_unrolled_gc : 195.386 us
large_pointers_unrolled_more_gc : 200.178 us
Not really. Yet again, we can see that more unrolling doesn't always mean more optimization.
Conclusions
I think this little experiment has made it very clear that optimization depends heavily on the platform we are working on, meaning the hardware as well as the toolchain we use to generate our binaries; as we saw, two different compilers may optimize things very differently. That said, there are a few optimization techniques we can keep in mind (at least when working on embedded platforms):
- Use
-O3
or-Ofast
. Compilers are, most of the time, very clever about optimizing code. It takes a few keyboard strokes to add a flag to your Makefile, but it can end up cutting down on your execution time by a lot. Use that to your advantage. - Avoid repeating calculations, accessing the memory etc. Most loops will run for quite a few iterations, so all these small overheads add up. Try to have the minimum number of memory read/writes and (especially costly) operations in your loops.
- Favor descending loops. We saw that in desktop processors this doesn't seem to help
all that much, however in MCUs it makes a world of difference (just compare the
execution times of
naive_gc
andnaive_descending_gc
when run on the STM32 MCU). - If you really need to cut down on time, give unrolling a try. It might not always work or your compiler might do it automatically when optimizing for speed, but it's worth a try. However, keep in mind that there's a tradeoff between loop iterations and instructions in the loop, so don't overdo it; try and find the sweet spot where it actually makes your code run faster.
Also, a final point on MCUs: be very careful about how you profile your code. Doing
things that seem trivial in a desktop processor (such as calls to printf
to print some
debugging data) can noticeably slow down your code. Also try and use the lowest level of
timing possible in order to get maximum precision; the profiler I linked to in the
previous section performs extremely well in ARM Cortex-M CPUs, because it uses the
debug interface of the CPU which was precisely made for profiling.
If you have any questions or suggestions for additions to this post, don't hesitate to send me an email!
Acknowledgments
I'd like to thank Ayoub for his input on my optimization methodology, as well as his advice on how to efficiently profile code on STM32. :)
Code
In case you want to quickly test this for yourself, here's my full code for the desktop version:
main.c
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 |
|
-
Unfortunately, the link is now dead :( ↩