The examples are fun, but probably more useful than yet another article saying how amazing optimizing compilers are (they are, I already know), would be an article with examples where obvious optimizations were missed.
eg 1:
int bar(int num) { return num / 2; }
Doesn't get optimized to a single shift right, because the that won't work if num is negative. In this case we can change the ints to unsigneds to tell the compiler we know the number isn't negative. But it isn't always easy to express to the compiler everything you know about your data and use case.
int foo(char const *s) {
if (strlen(s) < 3) return 0;
if (strcmp(s, "hello") == 0)
return 1;
return 0;
}
This function returns 1 if s is "hello". 0 otherwise. I've added a pointless strlen(). It seems like no compiler is clever enough to notice. https://godbolt.org/z/Koj65eo5K
For people who enjoy these blogs, you would definitely like the Julia REPL as well. I used to play with this a lot to discover compiler things.
For example:
$ julia
julia> function f(n)
total = 0
for x in 1:n
total += x
end
return total
end
julia> @code_native f(10)
...
sub x9, x0, #2
mul x10, x8, x9
umulh x8, x8, x9
extr x8, x8, x10, #1
add x8, x8, x0, lsl #1
sub x0, x8, #1
ret
...
it shows this with nice colors right in the REPL.
In the example above, you see that LLVM figured out the arithmetic series and replaced the loop with a simple multiplication.
This and add_v3 in the OP fall into the general class of Scalar Evolution optimizations (SCEV). LLVM for example is able to handle almost all Brainfuck loops in practice---add_v3 indeed corresponds to a Brainfuck loop `[->+<]`---, and its SCEV implementation is truly massive: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Anal...
I'm wondering how the compiler optimised add_v3() and add_v4() though.
Was it through "idiom detection", i.e. by recognising those specific patterns, or did the compiler deduce the answers them through some more involved analysis?
I always code with the mindset “the compiler is smarter than me.” No need to twist my logic around attempting to squeeze performance out of the processor - write something understandable to humans, let the computer do what computers do.
This is decent advice in general, but it pays off to try and express your logic in a way that is machine friendly. That mostly means thinking carefully about how you organize the data you work with. Optimizers generally don't change data structures or memory layout but that can make orders of magnitude difference in the performance of your program. It is also often difficult to refactor later.
I find the same too. I find gcc and clang can inline functions, but can't decide to break apart a struct used only among those inlined functions and make every struct member a local variable, and then decide that one or more of those local variables should be allocated as a register for the full lifetime of the function, rather than spill onto the local stack.
So if you use a messy solution where something that should be a struct and operated on with functions, is actually just a pile of local variables within a single function, and you use macros operating on local variables instead of inlineable functions operating on structs, you get massively better performance.
The ability of turning stack allocated variables into locals(which can be then put in registers) is one of the most important passes of modern compilers.
Since compilers use SSA, where locals are immutable while lots of languages, like C have mutable variables, some compiler frontends put locals onto the stack, and let the compiler figure out what can be put into locals and how.
That's really good; clearly I haven't looked at more recent versions. The magic seems to happen in your link at SROAPass, "Scalar Replacement Of Aggregates". Very cool!
> This pass takes allocations which can be completely analyzed (that is, they don't escape) and tries to turn them into scalar SSA values.
That's actually a useful hint to me. When I was trying to replace locals and macros with a struct and functions, I also used the struct directly in another struct (which was the wider source of persistence across functions), so perhaps this pass thought the struct _did_ escape. I should revisit my code and see if I can tweak it to get this optimisation applied.
I guess the chances of the compiler doing something smart increases with link-time optimizations and when keeping as much as possible inside the same "compilation unit". (In practice in the same source file.)
To make a more specific example, if you malloc()/free() within a loop, it's unlikely that the compiler will fix that for you. However, moving those calls outside of the loop (plus maybe add some realloc()s within, only if needed) is probably going to perform better.
That is something that can be easily found and usually fixed with trivial profiling. I'm more talking about data locality instead of pointer chasing. Once you set up a pointer-chasing data infrastructure changing that means rewriting most of your application.
I would take it one step further, often trying to eke out performance gains with clever tricks can hurt performance by causing you to "miss the forest for the trees".
I work with Cuda kernels a lot for computer vision. I am able to consistently and significantly improve on the performance of research code without any fancy tricks, just with good software engineering practices.
By organising variables into structs, improving naming, using helper functions, etc... the previously impenetrable code becomes so much clearer and the obvious optimisations reveal themselves.
Not to say there aren't certain tricks / patterns / gotchas / low level hardware realities to keep in mind, of course.
> I always code with the mindset “the compiler is smarter than me.”
Like with people in general, it depends on what compiler/interpreter we're talking about, I'll freely grant that clang is smarter than me, but CPython for sure isn't. :)
More generally, canonicalization goes very far, but no farther than language semantics allows. Not even the notorious "sufficiently smart compiler" with infinite time can figure out what you don't tell it.
To add to this, the low-level constraints also make this assumption noisy, no matter how smart the compiler is. On the CPython case, if you do `dis.dis('DAY = 24 * 60 * 60)` you will see that constant folding nicely converts it to `LOAD_CONST 86400`. However, if you try `dis.dis('ATOMS_IN_THE_WORLD = 10*50')` you will get LOAD_CONST 10, LOAD_CONST 50, BINARY_OP **.
I go with "You are responsible for the algorithms, it is responsible for the code micro optimizations". The compiler can't optimize you out of an SQL N+1 situation, that is on me to avoid, but it is better than me at loop unrolling.
> I always code with the mindset “the compiler is smarter than me.”
...I don't know... for instance the MSVC compiler creates this output for the last two 'non-trivial' functions with '/Ox':
add w8,w1,w0
cmp w0,#0
cseleq w0,w1,w8
Even beginner assembly coders on their first day wouldn't write such bullshit :)
A better mindset is "don't trust the compiler for code that's actually performance sensitive".
You shouldn't validate each line of compiler output, but at least for the 'hot areas' in the code base that definitely pays off, because sometimes compilers do really weird shit for no good reason (often because of 'interference' between unrelated optimizer passes) - and often you don't need to dig deep to stumble over weird output like in the example above.
I see the msvc arm compiler has not improved much in 20 years. The msvc arm was pretty odd when we used it in ~2003. We did not trust it at all. Think we had to get 4 or so compiler fixes out of MS for that project plus 3 or 4 library fixes. The x86 one was pretty solid. We were targeting 4 different CPU platforms at the same time so we could find things like that decently quickly. Most of the the time it was something we did that was weird. But even then we would find them. That one looks like maybe the optimizer back filled a nop slot?
The fact that compilers are smart isn't an excuse to not think about performance at all. They can't change your program architecture, algorithms, memory access patterns, etc.
You can mostly not think about super low level integer manipulation stuff though.
I would modify this a bit. Someone with decent computer architecture knowledge, tools, and time can generally do better than the compiler. But you generally won't, because you have a lot of other things to think about. So I'd state this as, "the compiler is more diligent and consistent than me." It's not so much that it can spot a for loop that's equivalent to a single add, but that it will spot it just about every time, so you don't have to worry about it.
You say that, but I was able to reduce the code size of some avr8 stuff I was working on by removing a whole bunch of instructions that zero out registers and then shift a value around. I don't it to literally shift the top byte 24 bits to the right and zero out the upper 24 bits, I just need it to pass the value in the top 8 bits direct to the next operation.
I agree that most people are not writing hand-tuned avr8 assembly. Most people aren't attempting to do DSP on 8-bit AVRs either.
What I am curious about is, is the compiler smart enough to be lazy with computation and or variables? For example consider:
let a = expr
let b = expr2
if (a || b) { return true; }
is the compiler allowed to lazily compute this if it is indeed faster to do that way? Or declaring a bunch of variables that may or may not be used in all of the branches. Is the compiler smart enough to only compute them whenever it is necessary? AFAIK this is now allowed in C-like languages. Things have to materialize. Another one is, I like to do memcpy every single time eventhough it might not even be used or overwritten by other memcpys. Is the compiler smart enough to not perform those and reorder my program so that only the last relevant memcpy is performed?
A lot of times, my code becomes ugly because I don't trust that it does any of this. I would like t write code in consistent and simple ways but I need compilers to be much smarter than it is today.
A bad example recently is something like
const S * s =;
let a = constant;
let b = constant;
let c = constant;
let d = constant;
let e = constant;
let f = constant;
let g = constant;
let h = constant;
let i = constant;
let j = constant;
let k = constant;
let l = constant;
if (s->a == a && s->b == b /* etc */ ) { return true; }
It did not turn all of this into a SIMD mask or something like that.
> Is the compiler smart enough to only compute them whenever it is necessary?
This is known as "code sinking," and most optimizers are capable of doing this. Except keep in mind that a) the profitability of doing so is not always clear [1] and b) the compiler is a lot more fastidious about corner-case behavior than you are, so it might conclude that it's not in fact safe to sink the operation when you think it is safe to do so.
[1] If the operation to sink is x = y + z, you now may need to keep the values of y and z around longer to compute the addition, increasing register pressure and potentially hurting performance as a result.
I wonder if compilers do multiple passes on the intermediate code in order to optimize / simplify it. For example, during each pass the optimizer searches some known harcoded patterns and replaces them with something else and repeats until no possible improvement is found.
Also optimizers have a limit, they can't reason as abstractly as humans, for example:
bool is_divisible_by_6(int x) {
return x % 2 == 0 && x % 3 == 0;
}
bool is_divisible_by_6_optimal(int x) {
return x % 6 == 0;
}
I tried with both gcc and clang, the asm code for is_divisible_by_6 is still less optimal. So no, there are plenty of easy ways to fool the optimizer by obfuscation.
The morale is that you still have to optimize algorithms (O notation) and math operations / expressions.
So you claim that the compiler "knows about this but doesn't optimize because of some safety measures"? As far as I remember, compilers don't optimize math expressions / brackets, probably because the order of operations might affect the precision of ints/floats, also because of complexity.
But my example is trivial (x % 2 == 0 && x % 3 == 0 is exactly the same as x % 6 == 0 for all C/C++ int), yet the compiler produced different outputs (the outputs are different and most likely is_divisible_by_6 is slower). Also what null (you mean 0?) checks are you talking about? The denominator is not null/0. Regardless, my point about not over relying on compiler optimization (especially for macro algorithms (O notation) and math expressions) remains valid.
x % 3 == 0 is an expression without side effects (the only cases that trap on a % operator are x % 0 and INT_MIN % -1), and thus the compiler is free to speculate the expression, allowing the comparison to be converted to (x % 2 == 0) & (x % 3 == 0).
Yes, compilers will tend to convert && and || to non-short-circuiting operations when able, so as to avoid control flow.
They do, and the order of the passes matter. Sometimes, optimizations are missed because they require a certain order of passes that is different from the one your compiler uses.
On higher optimization levels, many passes occur multiple times. However, as far as I know, compilers don't repeatedly run passes until they've reached an optimum. Instead, they run a fixed series of passes. I don't know why, maybe someone can chime in.
I don't know enough about ASM. Are u saying the first one is more optimal because it is faster or because it uses less instructions? Would this reflect a real world use case? Do any other compilers (e.g. V8) optimize modulo's into something else?
The compiler didn't recognize that x % 2 == 0 && x % 3 == 0 is exactly the same as x % 6 == 0 for all C/C++ int values. In theory a compiler could detect that and generate identical code for both functions, but it isn't done because this case is "niche" despite being trivial. My point is not to over rely on optimizer for math expressions and algorithms.
This post assumes C/C++ style business logic code.
Anything HPC will benefit from thinking about how things map onto hardware (or, in case of SQL, onto data structures).
I think way too few people use profilers. If your code is slow, profiling is the first tool you should reach for. Unfortunately, the state of profiling tools outside of NSight and Visual Studio (non-Code) is pretty disappointing.
unsigned add(unsigned x, unsigned y) {
unsigned a, b;
do {
a = x & y; /* every position where addition will generate a carry */
b = x ^ y; /* the addition, with no carries */
x = a << 1; /* the carries */
y = b;
/* if there were any carries, repeat the loop */
} while (a);
return b;
}
It's easy to show that this algorithm is correct in the sense that, when b is returned, it must be equal to x+y. x+y summing to a constant is a loop invariant, and at termination x is 0 and y is b.
It's a little more difficult to see that the loop will necessarily terminate.
New a values come from a bitwise & of x and y. New x values come from a left shift of a. This means that, if x ends in some number of zeroes, the next value of a will also end in at least that many zeroes, and the next value of x will end in an additional zero (because of the left shift). Eventually a will end in as many zeroes as there are bits in a, and the loop will terminate.
Optimizing some critical code section and beating the compiler gives you a rush. But last time for me was around 8 years ago. They got really good.
Only higher level stuff like data reordering and latency hiding are left. At least for me. And even some of that can be automated with profile-guided optimization.
Obvious caveat: pushing this a bit further it can quickly fallback to the default case. The optimizer is a superpower but you still need to try to write efficient code.
unsigned add_v5(unsigned x, unsigned y) {
if (x == y) return 2 * x;
return x + y;
}
If compiler folks can chime in, I'm curious why incrementing in a loop can be unrolled and inspected to optimize to an addition, but doubling the number when both operands are equal can't?
> I'm curious why incrementing in a loop can be unrolled and inspected to optimize to an addition, but doubling the number when both operands are equal can’t?
I expect because the former helps more in optimising real-world code than the latter. It’s not worth the LLVM developer's time to make the compiler better for programs that it won’t see in practice.
It’s not as if the compiler did nothing with that code, though. It replaced the multiplication by a left shift and removed the branch.
> If compiler folks can chime in, I'm curious why incrementing in a loop can be unrolled and inspected to optimize to an addition, but doubling the number when both operands are equal can't?
Compilers are essentially massive towers of heuristics for which patterns to apply for optimization. We don't throw a general SMT solver at your code because that takes way too long to compile; instead, we look at examples of actual code and make reasonable efforts to improve code.
In the case of the incrementing in a loop, there is a general analysis called Scalar Evolution that recasts expressions as an affine expression of canonical loop iteration variables (i.e., f(x), where x is 0 on the first loop iteration, 1 on the second, etc.). In the loop `while (x--) y++;`, the x variable [at the end of each loop iteration] can be rewritten as x = x₀ + -1*i, while the y variable is y = y₀ + 1*i. The loop trip count can be solved to an exact count, so we can replace the use of y outside the loop with y = y₀ + 1*trip count = y₀ + x, and then the loop itself is dead and can be deleted. These are all optimizations that happen to be quite useful in other contexts, so it's able to easily recognize this form of loop.
In the example you give, the compiler has to recognize the equivalence of two values conditional on control flow. The problem is that this problem really starts to run into the "the time needed to optimize this isn't worth the gain you get in the end." Note that there are a lot of cases where you have conditional joins (these are "phis" in SSA optimizer parlance), most of which aren't meaningfully simplifiable, so you're cutting off the analysis for all but the simplest cases. At a guess, the simplification is looking for all of the input values to be of the same form, but 2 * x (which will actually be canonicalized to x << 1) is not the same form as x + y, so it's not going to see if the condition being used to choose between the same values would be sufficient to make some operation return the same value. There are representations that make this problem much easier (egraphs), but these are not the dominant form for optimizers at present.
This sort of pattern can't be found by incremental lowering (and isn't common enough to have more sophisticated analysis written for it) so it ends up in a local maximum.
Basically the idea for most compilers is to do a series of transforms which incrementally improve the program (or at least make it worse in understood and reversible ways). To do this transform you need the optimizer to do the (not always trivial) proof that the 2*x is equivalent to x+y, do the replacement, do the gvn to duplicate the adds and finally do the branch elimination. Each of these steps is however totally separate from one another and the first one doesn't trigger since as far as it's concerned a shift left is faster than an add so why should it do the replacement.
This is all even more complicated since what representation is faster can depend on the target.
I agree, but GCC manages the optimization, and not all optimizations need to take fewer cycles. The single instruction version is obviously better for -Os and it would probably be a win in general.
I’m not a compiler expert, an assembly expert or an ARM expert, so this may be wildly wrong, but this looks optimized to me.
The trick is that it’s doing both the add and the left shift in parallel then selecting which to use based on a compare of the two values with csel.
(To see this, rather than reading the code sequentially, think of every instruction as being issued at the same time until you hit an instruction that needs a destination register from an earlier instruction)
The add is stored in W9 but only read if the two arguments are unequal.
If the compare succeeds and the lsl retires before the add, the add is never read, so nothing stalls waiting for it and the answer can be returned while the add is still in flight. The result of the add would then be quietly discarded assuming it ever started (maybe there’s some magic where it doesn’t even happen at all?).
It’s not clear to me that this is power efficient, or that on many real cpus there’s a latency difference to exploit between add and lsl, so it may not be faster than just unconditionally doing the addition.
That said, it is definitely faster than the code as it was written which if translated to asm verbatim stalls on the compare before executing either the add or the left shift.
It's not. Why would lsl+csel or add+csel or cmp+csel ever be faster than a simple add? Or have higher throughput? Or require less energy? An integer addition is just about the lowest-latency operation you can do on mainstream CPUs, apart from register-renaming operations that never leave the front-end.
With this one I instead wondered: If there are 4 functions doing exactly the same thing, couldn't the compiler also only generate the code for one of them?
E.g. if in `main` you called two different add functions, couldn't it optimize one of them away completely?
It probably shouldn't do that if you create a dynamic library that needs a symbol table but for an ELF binary it could, no? Why doesn't it do that?
If your language has monomorphization† (as C++ and Rust do) then it's really common to have this commonality in the emitted code and I believe it is common for compilers to detect and condense the resulting identical machine code. If the foo<T> function for an integer checks if it's equal to four, it well be that on your target hardware that's the same exact machine code whether the integer types T are 1 byte, 2 bytes or 4 bytes and whether they're signed or unsigned, so we should only emit one such implementation of foo, not six for u8, i8, u16, i16, u32 and i32.
† Monomorphization takes Parametrically Polymorphic functions, ie functions which are strongly typed but those types are parameters at compile time, and it emits distinct machine code for each needed variation of the function, so e.g. add(a, b) maybe gets compiled to produce add_integer(a, b) and add_float(a, b) and add_matrix(a, b) even though we only wrote one function, and then code which calls add(a, b) with matrices, is at compile time emitted as calling add_matrix(a, b), because the compiler knew it needs that version. In C++ the number of parameters is also potentially allowed to vary between callers so add_matrix(a, b, c, d) might exist too, this feature is not yet available in Rust.
I'm sure it could also perform definition merging like you suggest but I can't think of a way of triggering it at the moment without also triggering their complete elision.
This is not quite what you asked, I think, but GCC is able to remove duplicate functions and variables after code generation via the -fipa-icf options:
> Perform Identical Code Folding for functions (-fipa-icf-functions), read-only variables (-fipa-icf-variables), or both (-fipa-icf). The optimization reduces code size and may disturb unwind stacks by replacing a function by an equivalent one with a different name. The optimization works more effectively with link-time optimization enabled.
In addition, the Gold linker supports a similar feature via `--icf={safe,all}`:
> Identical Code Folding. '--icf=safe' Folds ctors, dtors and functions whose pointers are definitely not taken
> It probably shouldn't do that if you create a dynamic library that needs a symbol table but for an ELF binary it could, no?
It can't do that because the program might load a dynamic library that depends on the function (it's perfectly OK for a `.so` to depend on a function from the main executable, for example).
That's one of the reasons why a very cheap optimization is to always use `static` for functions when you can. You're telling the compiler that the function doesn't need to be visible outside the current compilation unit, so the compiler is free to even inline it completely and never produce an actual callable function, if appropriate.
Sadly most C++ projects are organized in a way that hampers static functions. To achieve incremental builds, stuff is split into separate source files that are compiled and optimized separately, and only at the final step linked, which requires symbols of course.
I get it though, because carefully structuring your #includes to get a single translation unit is messy, and compile times get too long.
That’s where link-time optimization enters the picture. It’s expensive but tolerable for production builds of small projects and feasible for mid-sized ones.
> It can't do that because the program might load a dynamic library that depends on the function
That makes perfect sense, thank you!
And I just realized why I was mistaken. I am using fasm with `format ELF64 executable` to create a ELF file. Looking at it with a hex editor, it has no sections or symbol table because it creates a completely stripped binary.
The MSVC linker has a feature where it will merge byte-for-byte identical functions. It's most noticeable for default constructors, you might get hundreds of functions which all boil down to "zero the first 32 bytes of this type".
Nope. Function with external linkage are required to have different addresses. MSVC actually breaks this and this means that you can't reliably compare function pointers on MSVC because some different functions may happen to have same object code by chance:
But it could generate one then make the remaining three tail call to that one, or lay them out so that they are at 1byte-nop each to the next one and fallthrough the next until the last one implements the logic (This is a bit more compilcated on msvc as I believe the ABI requires a well defined prologue).
They can't be at 1byte-nop distance because pointer addresses as well as branch target addresses are expected to be aligned for performance reasons - often to 16 bytes. You need either a nop sequence or a jump/tailcall.
"The compiler" and "The optimizer" are doing a lot of the heavy lifting here in the argument. I definitely know compilers and optimizers which are not that great. Then again, they are not turning C++ code into ARM instructions.
You absolutely can fool a lot of compilers out there! And I am not only looking at you, NVCC.
But the point should be to follow the optimization cycle: develop, benchmark, evaluate, profile, analyze, optimize. Writing performant code is no joke and very often destroys readability and introduces subtle bugs, so before trying to oursmart the compiler, evaluate if what it produces is good enough already
Awesome blog post - thanks to this I found out that you can view what the LLVM optimizer pipeline does, and which pass is actually responsible for doing which instruction.
It's super cool to see this in practice, and for me it helps putting more trust in the compiler that it does the right thing, rather than me trying to micro-optimize my code and peppering inline qualifiers everywhere.
I liked the idea behind this post, but really the author fairly widely missed the mark in my opinion.
The extent to which you can "fool the optimizer" is highly dependent on the language and the code you're talking about. Python is a great example of a language that is devilishly hard to optimize for precisely because of the language semantics. C and C++ are entirely different examples with entirely different optimization issues, usually which have to do with pointers and references and what the compiler is allowed to infer.
The point? Don't just assume your compiler will magically make all your performance issues go away and produce optimal code. Maybe it will, maybe it won't.
As always, the main performance lessons should always be "1) Don't prematurely optimize", and "2) If you see perf issues, run profilers to try to definitively nail where the perf issue is".
Wait, why does GAS use Intel syntax for ARM instead of AT&T? Or something that looks very much like it: the destination is the first operand, not the last, and there is no "%" prefix for the register names?
I want an AI optimization helper that recognizes patterns that could-almost be optimized if I gave it a little help, e.g. hints about usage, type, etc.
The examples are fun, but probably more useful than yet another article saying how amazing optimizing compilers are (they are, I already know), would be an article with examples where obvious optimizations were missed.
eg 1:
Doesn't get optimized to a single shift right, because the that won't work if num is negative. In this case we can change the ints to unsigneds to tell the compiler we know the number isn't negative. But it isn't always easy to express to the compiler everything you know about your data and use case.eg 2:
We all know that strlen will return 5, but some compilers don't: https://godbolt.org/z/M7x5qraE6eg 3:
This function returns 1 if s is "hello". 0 otherwise. I've added a pointless strlen(). It seems like no compiler is clever enough to notice. https://godbolt.org/z/Koj65eo5KFor people who enjoy these blogs, you would definitely like the Julia REPL as well. I used to play with this a lot to discover compiler things.
For example:
it shows this with nice colors right in the REPL.In the example above, you see that LLVM figured out the arithmetic series and replaced the loop with a simple multiplication.
This and add_v3 in the OP fall into the general class of Scalar Evolution optimizations (SCEV). LLVM for example is able to handle almost all Brainfuck loops in practice---add_v3 indeed corresponds to a Brainfuck loop `[->+<]`---, and its SCEV implementation is truly massive: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Anal...
LLVM can do more complex sums, too. See https://kristerw.blogspot.com/2019/04/how-llvm-optimizes-geo...
I'm wondering how the compiler optimised add_v3() and add_v4() though.
Was it through "idiom detection", i.e. by recognising those specific patterns, or did the compiler deduce the answers them through some more involved analysis?
add_v3() is the result of induction variable simplification: https://llvm.org/doxygen/IndVarSimplify_8cpp_source.html
I always code with the mindset “the compiler is smarter than me.” No need to twist my logic around attempting to squeeze performance out of the processor - write something understandable to humans, let the computer do what computers do.
This is decent advice in general, but it pays off to try and express your logic in a way that is machine friendly. That mostly means thinking carefully about how you organize the data you work with. Optimizers generally don't change data structures or memory layout but that can make orders of magnitude difference in the performance of your program. It is also often difficult to refactor later.
I find the same too. I find gcc and clang can inline functions, but can't decide to break apart a struct used only among those inlined functions and make every struct member a local variable, and then decide that one or more of those local variables should be allocated as a register for the full lifetime of the function, rather than spill onto the local stack.
So if you use a messy solution where something that should be a struct and operated on with functions, is actually just a pile of local variables within a single function, and you use macros operating on local variables instead of inlineable functions operating on structs, you get massively better performance.
e.g.
The nice thing about godbolt is that it can show you that clang not only can but do it in theory but also does it in practice:
https://aoco.compiler-explorer.com/#g:!((g:!((g:!((h:codeEdi...
The ability of turning stack allocated variables into locals(which can be then put in registers) is one of the most important passes of modern compilers.
Since compilers use SSA, where locals are immutable while lots of languages, like C have mutable variables, some compiler frontends put locals onto the stack, and let the compiler figure out what can be put into locals and how.
That's really good; clearly I haven't looked at more recent versions. The magic seems to happen in your link at SROAPass, "Scalar Replacement Of Aggregates". Very cool!
According to https://docs.hdoc.io/hdoc/llvm-project/r2E8025E445BE9CEE.htm...
> This pass takes allocations which can be completely analyzed (that is, they don't escape) and tries to turn them into scalar SSA values.
That's actually a useful hint to me. When I was trying to replace locals and macros with a struct and functions, I also used the struct directly in another struct (which was the wider source of persistence across functions), so perhaps this pass thought the struct _did_ escape. I should revisit my code and see if I can tweak it to get this optimisation applied.
I guess the chances of the compiler doing something smart increases with link-time optimizations and when keeping as much as possible inside the same "compilation unit". (In practice in the same source file.)
To make a more specific example, if you malloc()/free() within a loop, it's unlikely that the compiler will fix that for you. However, moving those calls outside of the loop (plus maybe add some realloc()s within, only if needed) is probably going to perform better.
That is something that can be easily found and usually fixed with trivial profiling. I'm more talking about data locality instead of pointer chasing. Once you set up a pointer-chasing data infrastructure changing that means rewriting most of your application.
I would take it one step further, often trying to eke out performance gains with clever tricks can hurt performance by causing you to "miss the forest for the trees".
I work with Cuda kernels a lot for computer vision. I am able to consistently and significantly improve on the performance of research code without any fancy tricks, just with good software engineering practices.
By organising variables into structs, improving naming, using helper functions, etc... the previously impenetrable code becomes so much clearer and the obvious optimisations reveal themselves.
Not to say there aren't certain tricks / patterns / gotchas / low level hardware realities to keep in mind, of course.
> I always code with the mindset “the compiler is smarter than me.”
Like with people in general, it depends on what compiler/interpreter we're talking about, I'll freely grant that clang is smarter than me, but CPython for sure isn't. :)
More generally, canonicalization goes very far, but no farther than language semantics allows. Not even the notorious "sufficiently smart compiler" with infinite time can figure out what you don't tell it.
To add to this, the low-level constraints also make this assumption noisy, no matter how smart the compiler is. On the CPython case, if you do `dis.dis('DAY = 24 * 60 * 60)` you will see that constant folding nicely converts it to `LOAD_CONST 86400`. However, if you try `dis.dis('ATOMS_IN_THE_WORLD = 10*50')` you will get LOAD_CONST 10, LOAD_CONST 50, BINARY_OP **.
> “the compiler is smarter than me.”
This is true, but it also means "the compiler IS made for someone median smart, that now knows the machine".
It works great for basic, simple, common code, and for code that is made with care for data structures.
A total mess of code is another story.
P.D: is similar to the query optimizers, that neither can outrun a terrible made schema and queries
This is very often true when your data is sitting right there on the stack.
Though when your data is behind pointers, it's very easy to write code that the compiler can no longer figure out how to optimize.
I go with "You are responsible for the algorithms, it is responsible for the code micro optimizations". The compiler can't optimize you out of an SQL N+1 situation, that is on me to avoid, but it is better than me at loop unrolling.
> I always code with the mindset “the compiler is smarter than me.”
...I don't know... for instance the MSVC compiler creates this output for the last two 'non-trivial' functions with '/Ox':
Even beginner assembly coders on their first day wouldn't write such bullshit :)A better mindset is "don't trust the compiler for code that's actually performance sensitive".
You shouldn't validate each line of compiler output, but at least for the 'hot areas' in the code base that definitely pays off, because sometimes compilers do really weird shit for no good reason (often because of 'interference' between unrelated optimizer passes) - and often you don't need to dig deep to stumble over weird output like in the example above.
I see the msvc arm compiler has not improved much in 20 years. The msvc arm was pretty odd when we used it in ~2003. We did not trust it at all. Think we had to get 4 or so compiler fixes out of MS for that project plus 3 or 4 library fixes. The x86 one was pretty solid. We were targeting 4 different CPU platforms at the same time so we could find things like that decently quickly. Most of the the time it was something we did that was weird. But even then we would find them. That one looks like maybe the optimizer back filled a nop slot?
The fact that compilers are smart isn't an excuse to not think about performance at all. They can't change your program architecture, algorithms, memory access patterns, etc.
You can mostly not think about super low level integer manipulation stuff though.
I would modify this a bit. Someone with decent computer architecture knowledge, tools, and time can generally do better than the compiler. But you generally won't, because you have a lot of other things to think about. So I'd state this as, "the compiler is more diligent and consistent than me." It's not so much that it can spot a for loop that's equivalent to a single add, but that it will spot it just about every time, so you don't have to worry about it.
You say that, but I was able to reduce the code size of some avr8 stuff I was working on by removing a whole bunch of instructions that zero out registers and then shift a value around. I don't it to literally shift the top byte 24 bits to the right and zero out the upper 24 bits, I just need it to pass the value in the top 8 bits direct to the next operation.
I agree that most people are not writing hand-tuned avr8 assembly. Most people aren't attempting to do DSP on 8-bit AVRs either.
also not all software need optimization to the bone
pareto principle like always, dont need the best but good enough
not every company is google level anyway
What I am curious about is, is the compiler smart enough to be lazy with computation and or variables? For example consider:
let a = expr let b = expr2
if (a || b) { return true; }
is the compiler allowed to lazily compute this if it is indeed faster to do that way? Or declaring a bunch of variables that may or may not be used in all of the branches. Is the compiler smart enough to only compute them whenever it is necessary? AFAIK this is now allowed in C-like languages. Things have to materialize. Another one is, I like to do memcpy every single time eventhough it might not even be used or overwritten by other memcpys. Is the compiler smart enough to not perform those and reorder my program so that only the last relevant memcpy is performed?
A lot of times, my code becomes ugly because I don't trust that it does any of this. I would like t write code in consistent and simple ways but I need compilers to be much smarter than it is today.
A bad example recently is something like
const S * s =;
let a = constant; let b = constant; let c = constant; let d = constant; let e = constant; let f = constant; let g = constant; let h = constant; let i = constant; let j = constant; let k = constant; let l = constant;
if (s->a == a && s->b == b /* etc */ ) { return true; }
It did not turn all of this into a SIMD mask or something like that.
> Is the compiler smart enough to only compute them whenever it is necessary?
This is known as "code sinking," and most optimizers are capable of doing this. Except keep in mind that a) the profitability of doing so is not always clear [1] and b) the compiler is a lot more fastidious about corner-case behavior than you are, so it might conclude that it's not in fact safe to sink the operation when you think it is safe to do so.
[1] If the operation to sink is x = y + z, you now may need to keep the values of y and z around longer to compute the addition, increasing register pressure and potentially hurting performance as a result.
I wonder if compilers do multiple passes on the intermediate code in order to optimize / simplify it. For example, during each pass the optimizer searches some known harcoded patterns and replaces them with something else and repeats until no possible improvement is found.
Also optimizers have a limit, they can't reason as abstractly as humans, for example:
I tried with both gcc and clang, the asm code for is_divisible_by_6 is still less optimal. So no, there are plenty of easy ways to fool the optimizer by obfuscation.The morale is that you still have to optimize algorithms (O notation) and math operations / expressions.
Those aren't isomorphic. The C spec says `is_divisible_by_6` short-circuits. You don't want the compiler optimising away null checks.
https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf
6.5.13, semantics
So you claim that the compiler "knows about this but doesn't optimize because of some safety measures"? As far as I remember, compilers don't optimize math expressions / brackets, probably because the order of operations might affect the precision of ints/floats, also because of complexity.
But my example is trivial (x % 2 == 0 && x % 3 == 0 is exactly the same as x % 6 == 0 for all C/C++ int), yet the compiler produced different outputs (the outputs are different and most likely is_divisible_by_6 is slower). Also what null (you mean 0?) checks are you talking about? The denominator is not null/0. Regardless, my point about not over relying on compiler optimization (especially for macro algorithms (O notation) and math expressions) remains valid.
x % 3 == 0 is an expression without side effects (the only cases that trap on a % operator are x % 0 and INT_MIN % -1), and thus the compiler is free to speculate the expression, allowing the comparison to be converted to (x % 2 == 0) & (x % 3 == 0).
Yes, compilers will tend to convert && and || to non-short-circuiting operations when able, so as to avoid control flow.
They do, and the order of the passes matter. Sometimes, optimizations are missed because they require a certain order of passes that is different from the one your compiler uses.
On higher optimization levels, many passes occur multiple times. However, as far as I know, compilers don't repeatedly run passes until they've reached an optimum. Instead, they run a fixed series of passes. I don't know why, maybe someone can chime in.
I don't know enough about ASM. Are u saying the first one is more optimal because it is faster or because it uses less instructions? Would this reflect a real world use case? Do any other compilers (e.g. V8) optimize modulo's into something else?
The compiler didn't recognize that x % 2 == 0 && x % 3 == 0 is exactly the same as x % 6 == 0 for all C/C++ int values. In theory a compiler could detect that and generate identical code for both functions, but it isn't done because this case is "niche" despite being trivial. My point is not to over rely on optimizer for math expressions and algorithms.
This post assumes C/C++ style business logic code.
Anything HPC will benefit from thinking about how things map onto hardware (or, in case of SQL, onto data structures).
I think way too few people use profilers. If your code is slow, profiling is the first tool you should reach for. Unfortunately, the state of profiling tools outside of NSight and Visual Studio (non-Code) is pretty disappointing.
I don’t disagree, but profiling also won’t help you with death by a thousand indirections.
You can fool the optimizer, but you have to work harder to do so:
becomes (with armv8-a clang 21.1.0 -O3) :Since I had to think about it:
It's easy to show that this algorithm is correct in the sense that, when b is returned, it must be equal to x+y. x+y summing to a constant is a loop invariant, and at termination x is 0 and y is b.It's a little more difficult to see that the loop will necessarily terminate.
New a values come from a bitwise & of x and y. New x values come from a left shift of a. This means that, if x ends in some number of zeroes, the next value of a will also end in at least that many zeroes, and the next value of x will end in an additional zero (because of the left shift). Eventually a will end in as many zeroes as there are bits in a, and the loop will terminate.
Even better / potentially more surprising:
optimizes to: (and this produces the same result when substituting any of the `add_vN`s from TFA)Optimizing some critical code section and beating the compiler gives you a rush. But last time for me was around 8 years ago. They got really good.
Only higher level stuff like data reordering and latency hiding are left. At least for me. And even some of that can be automated with profile-guided optimization.
Sometimes you can fool the compiler :-)
See "Example 2: Tricking the compiler" in my blog post about O3 sometimes being slower than O2: https://barish.me/blog/cpp-o3-slower/
Obvious caveat: pushing this a bit further it can quickly fallback to the default case. The optimizer is a superpower but you still need to try to write efficient code.
Results in: (armv8-a clang 21.1.0 with O3)If compiler folks can chime in, I'm curious why incrementing in a loop can be unrolled and inspected to optimize to an addition, but doubling the number when both operands are equal can't?
> I'm curious why incrementing in a loop can be unrolled and inspected to optimize to an addition, but doubling the number when both operands are equal can’t?
I expect because the former helps more in optimising real-world code than the latter. It’s not worth the LLVM developer's time to make the compiler better for programs that it won’t see in practice.
It’s not as if the compiler did nothing with that code, though. It replaced the multiplication by a left shift and removed the branch.
> If compiler folks can chime in, I'm curious why incrementing in a loop can be unrolled and inspected to optimize to an addition, but doubling the number when both operands are equal can't?
Compilers are essentially massive towers of heuristics for which patterns to apply for optimization. We don't throw a general SMT solver at your code because that takes way too long to compile; instead, we look at examples of actual code and make reasonable efforts to improve code.
In the case of the incrementing in a loop, there is a general analysis called Scalar Evolution that recasts expressions as an affine expression of canonical loop iteration variables (i.e., f(x), where x is 0 on the first loop iteration, 1 on the second, etc.). In the loop `while (x--) y++;`, the x variable [at the end of each loop iteration] can be rewritten as x = x₀ + -1*i, while the y variable is y = y₀ + 1*i. The loop trip count can be solved to an exact count, so we can replace the use of y outside the loop with y = y₀ + 1*trip count = y₀ + x, and then the loop itself is dead and can be deleted. These are all optimizations that happen to be quite useful in other contexts, so it's able to easily recognize this form of loop.
In the example you give, the compiler has to recognize the equivalence of two values conditional on control flow. The problem is that this problem really starts to run into the "the time needed to optimize this isn't worth the gain you get in the end." Note that there are a lot of cases where you have conditional joins (these are "phis" in SSA optimizer parlance), most of which aren't meaningfully simplifiable, so you're cutting off the analysis for all but the simplest cases. At a guess, the simplification is looking for all of the input values to be of the same form, but 2 * x (which will actually be canonicalized to x << 1) is not the same form as x + y, so it's not going to see if the condition being used to choose between the same values would be sufficient to make some operation return the same value. There are representations that make this problem much easier (egraphs), but these are not the dominant form for optimizers at present.
This sort of pattern can't be found by incremental lowering (and isn't common enough to have more sophisticated analysis written for it) so it ends up in a local maximum.
Basically the idea for most compilers is to do a series of transforms which incrementally improve the program (or at least make it worse in understood and reversible ways). To do this transform you need the optimizer to do the (not always trivial) proof that the 2*x is equivalent to x+y, do the replacement, do the gvn to duplicate the adds and finally do the branch elimination. Each of these steps is however totally separate from one another and the first one doesn't trigger since as far as it's concerned a shift left is faster than an add so why should it do the replacement.
This is all even more complicated since what representation is faster can depend on the target.
I agree, but GCC manages the optimization, and not all optimizations need to take fewer cycles. The single instruction version is obviously better for -Os and it would probably be a win in general.
I’m not a compiler expert, an assembly expert or an ARM expert, so this may be wildly wrong, but this looks optimized to me.
The trick is that it’s doing both the add and the left shift in parallel then selecting which to use based on a compare of the two values with csel.
(To see this, rather than reading the code sequentially, think of every instruction as being issued at the same time until you hit an instruction that needs a destination register from an earlier instruction)
The add is stored in W9 but only read if the two arguments are unequal.
If the compare succeeds and the lsl retires before the add, the add is never read, so nothing stalls waiting for it and the answer can be returned while the add is still in flight. The result of the add would then be quietly discarded assuming it ever started (maybe there’s some magic where it doesn’t even happen at all?).
It’s not clear to me that this is power efficient, or that on many real cpus there’s a latency difference to exploit between add and lsl, so it may not be faster than just unconditionally doing the addition.
That said, it is definitely faster than the code as it was written which if translated to asm verbatim stalls on the compare before executing either the add or the left shift.
> this looks optimized to me.
It's not. Why would lsl+csel or add+csel or cmp+csel ever be faster than a simple add? Or have higher throughput? Or require less energy? An integer addition is just about the lowest-latency operation you can do on mainstream CPUs, apart from register-renaming operations that never leave the front-end.
ARM is a big target, there could be cpus where lsl is 1 cycle and add is 2+.
Without knowing about specific compiler targets/settings this looks reasonable.
Dumb in the majority case? Absolutely, but smart on the lowest common denominator.
Interesting, even this can't fool the optimizer (tried with a recent gcc and clang):
With this one I instead wondered: If there are 4 functions doing exactly the same thing, couldn't the compiler also only generate the code for one of them?
E.g. if in `main` you called two different add functions, couldn't it optimize one of them away completely?
It probably shouldn't do that if you create a dynamic library that needs a symbol table but for an ELF binary it could, no? Why doesn't it do that?
If your language has monomorphization† (as C++ and Rust do) then it's really common to have this commonality in the emitted code and I believe it is common for compilers to detect and condense the resulting identical machine code. If the foo<T> function for an integer checks if it's equal to four, it well be that on your target hardware that's the same exact machine code whether the integer types T are 1 byte, 2 bytes or 4 bytes and whether they're signed or unsigned, so we should only emit one such implementation of foo, not six for u8, i8, u16, i16, u32 and i32.
† Monomorphization takes Parametrically Polymorphic functions, ie functions which are strongly typed but those types are parameters at compile time, and it emits distinct machine code for each needed variation of the function, so e.g. add(a, b) maybe gets compiled to produce add_integer(a, b) and add_float(a, b) and add_matrix(a, b) even though we only wrote one function, and then code which calls add(a, b) with matrices, is at compile time emitted as calling add_matrix(a, b), because the compiler knew it needs that version. In C++ the number of parameters is also potentially allowed to vary between callers so add_matrix(a, b, c, d) might exist too, this feature is not yet available in Rust.
It would but it's harder to trigger. Here, it's not safe because they're public functions and the standard would require `add_v1 != add_v2` (I think).
If you declare them as static, it eliminates the functions and the calls completely: https://aoco.compiler-explorer.com/z/soPqe7eYx
I'm sure it could also perform definition merging like you suggest but I can't think of a way of triggering it at the moment without also triggering their complete elision.
This is not quite what you asked, I think, but GCC is able to remove duplicate functions and variables after code generation via the -fipa-icf options:
> Perform Identical Code Folding for functions (-fipa-icf-functions), read-only variables (-fipa-icf-variables), or both (-fipa-icf). The optimization reduces code size and may disturb unwind stacks by replacing a function by an equivalent one with a different name. The optimization works more effectively with link-time optimization enabled.
In addition, the Gold linker supports a similar feature via `--icf={safe,all}`:
> Identical Code Folding. '--icf=safe' Folds ctors, dtors and functions whose pointers are definitely not taken
> It probably shouldn't do that if you create a dynamic library that needs a symbol table but for an ELF binary it could, no?
It can't do that because the program might load a dynamic library that depends on the function (it's perfectly OK for a `.so` to depend on a function from the main executable, for example).
That's one of the reasons why a very cheap optimization is to always use `static` for functions when you can. You're telling the compiler that the function doesn't need to be visible outside the current compilation unit, so the compiler is free to even inline it completely and never produce an actual callable function, if appropriate.
Sadly most C++ projects are organized in a way that hampers static functions. To achieve incremental builds, stuff is split into separate source files that are compiled and optimized separately, and only at the final step linked, which requires symbols of course.
I get it though, because carefully structuring your #includes to get a single translation unit is messy, and compile times get too long.
[[gnu::visibility(hidden)]] (or the equivalent for your compiler), might help.
That’s where link-time optimization enters the picture. It’s expensive but tolerable for production builds of small projects and feasible for mid-sized ones.
> It can't do that because the program might load a dynamic library that depends on the function
That makes perfect sense, thank you!
And I just realized why I was mistaken. I am using fasm with `format ELF64 executable` to create a ELF file. Looking at it with a hex editor, it has no sections or symbol table because it creates a completely stripped binary.
Learned something :)
The MSVC linker has a feature where it will merge byte-for-byte identical functions. It's most noticeable for default constructors, you might get hundreds of functions which all boil down to "zero the first 32 bytes of this type".
A quick google suggests it's called "identical comdat folding" https://devblogs.microsoft.com/oldnewthing/20161024-00/?p=94...
Nope. Function with external linkage are required to have different addresses. MSVC actually breaks this and this means that you can't reliably compare function pointers on MSVC because some different functions may happen to have same object code by chance:
Since, the pointers to go_forward and go_left will be the same, the gc_info table is less useless that it could be otherwise.But it could generate one then make the remaining three tail call to that one, or lay them out so that they are at 1byte-nop each to the next one and fallthrough the next until the last one implements the logic (This is a bit more compilcated on msvc as I believe the ABI requires a well defined prologue).
They can't be at 1byte-nop distance because pointer addresses as well as branch target addresses are expected to be aligned for performance reasons - often to 16 bytes. You need either a nop sequence or a jump/tailcall.
Sure, there are also probably pointer integrity landing pads. Make it larger nops then.
"The compiler" and "The optimizer" are doing a lot of the heavy lifting here in the argument. I definitely know compilers and optimizers which are not that great. Then again, they are not turning C++ code into ARM instructions.
You absolutely can fool a lot of compilers out there! And I am not only looking at you, NVCC.
But the point should be to follow the optimization cycle: develop, benchmark, evaluate, profile, analyze, optimize. Writing performant code is no joke and very often destroys readability and introduces subtle bugs, so before trying to oursmart the compiler, evaluate if what it produces is good enough already
Awesome blog post - thanks to this I found out that you can view what the LLVM optimizer pipeline does, and which pass is actually responsible for doing which instruction.
It's super cool to see this in practice, and for me it helps putting more trust in the compiler that it does the right thing, rather than me trying to micro-optimize my code and peppering inline qualifiers everywhere.
I liked the idea behind this post, but really the author fairly widely missed the mark in my opinion.
The extent to which you can "fool the optimizer" is highly dependent on the language and the code you're talking about. Python is a great example of a language that is devilishly hard to optimize for precisely because of the language semantics. C and C++ are entirely different examples with entirely different optimization issues, usually which have to do with pointers and references and what the compiler is allowed to infer.
The point? Don't just assume your compiler will magically make all your performance issues go away and produce optimal code. Maybe it will, maybe it won't.
As always, the main performance lessons should always be "1) Don't prematurely optimize", and "2) If you see perf issues, run profilers to try to definitively nail where the perf issue is".
I think the author is strictly talking about C and C++. Python is famously pessimal in all possible ways.
Wait, why does GAS use Intel syntax for ARM instead of AT&T? Or something that looks very much like it: the destination is the first operand, not the last, and there is no "%" prefix for the register names?
One undesirable property of optimizers is that in theory one day they produce good code and the next day they don't.
I want an AI optimization helper that recognizes patterns that could-almost be optimized if I gave it a little help, e.g. hints about usage, type, etc.
Today I learned that Matt Godbolt is British!
Better tell me how to make the compiler not fool me!
I'm curious what is the theoreme-proving magic behind add_v4 and if this is prior LLVM ir
Is this an argument for compiled code?
It's not really an argument for anything, it's just showing off how cool compilers are!