This is perhaps my favorite Stack Overflow answer of all time. I don't remember when I last saw such an approachable explanation of something so critical yet complicated.
> One of the canonical approaches, graph coloring, was first proposed in 1981.
This is about as far as my professor took this topic in class ~13 years ago. Nevertheless the slides that he used to illustrate how the graph coloring problem applied to register allocation stick with me to this day as one of the most elegant applications of CS I've ever seen (which are admittedly few as I'm not nearly as studied as I ought to be).
> Code generation is a surprisingly challenging and underappreciated aspect of compiler implementation, and quite a lot can happen under the hood even after a compiler’s IR optimization pipeline has finished. Register allocation is one of those things, and like many compilers topics, entire books could be written on it alone.
Our class final project targeted a register-based VM [1] for this exact reason. I also found writing MIPS assembly simpler than Y86 [2] because of the larger number of registers at my disposal (among the other obvious differences).
This is on 'Programming Language Design and Implementation Stack Exchange'[0] -- I'm not pointing this out to tell you that you're wrong -- I think the various 'Stacks Exchange' often have better, more thoughtful answers than the average Stack Overflow question.
For more theoretical and pure CS stuff, yes they do since some point between ~2012-5 when SO got overrun by web developers and people grinding coding challenges. And of course the mushrooming number of specialist SE network sites [https://stackexchange.com/sites] like PLD, Theoretical CS, CodeReview, Computational Science, DBA, SWE, CrossValidated etc. + SO in several non-English languages [https://stackoverflow.com/help/non-english-questions].
Even the standards of behavior between different tags on SO itself vary greatly (in terms of how well-written a question is, whether it has an MCVE, whether the OP check it wasn't a dupe and searched existing Q&A, etc.).
This is actually also the main thing I have ChatGPT write for me e.g. I can have it dial the level of mathematics to exactly where I can be bothered to tolerate (sometimes life is too short for symbols)
Amen to that. It’s amazing (to me) the difference in cultures of the different exchanges. I see some of the most awesome answers and explanations on aviation SE. Whereas I rarely try to post (answer or question) to SO anymore, because it feels like navigating the DMV.
Unfortunately, just like with parser generators and more powerful parsing algorithms (in particular bottom-up LR and such), the practice proved very different than the theory. Linear-scan and variants of it have become the norm for register allocation.
(There was much more talk recently of GCC's LRA than IRA because completing the reload-to-LRA transition in the compiler threatened the removal of some targets still without reload support.)
I've had a lot of success using chordal graph allocators. They provide plenty of extra dimensions of 'relaxation' to tune them, they're incremental (so they allow pinning), and they decay nicely when their constraints are violated. Because of their incremental nature & "niceness" of decay, they can be forced into a nice hierarchical form ("middle out" on the loops). The main driving algorithm (maximum cardinality search) is a little harebrained; but, if you just relax and write up the code, you'll find it is surprisingly short & robust, and highly amenable to unit testing.
Spilling/filling is a bit exciting, since chordal coloring doesn't provide a lot of direction, but I've found that pressure heuristics fill in the gap nicely. The whole thing relies on having a robust interference graph — which more than kind of sucks — but, we don't get into compilers unless we've weaponized our bit-set data-structures in the first place.
Is this true though? Last time I worked on a compiler (admittedly quite a few years ago) Briggs was the bare minimum; our compiler in particular used an improvement over Callahan's hierarchical register allocation (the basic idea of which is that you should prioritize allocation in innermost loops, over a better "global" graph coloring, since spilling once in the inner loop costs way more than spilling several registers in the linear/ setup part of the code).
I would expect that only compilers for immature languages (that don't care about optimization) use naive RA.
At least GCC appears to use a graph coloring algorithm. LLVM seems to have moved from linear scan to a custom algorithm in version 3; I have no idea what they're using nowadays.
LLVM now uses something they call the "Greedy Register Allocator". As far as I can tell, it's a variation on the linear allocator with some heuristics. Here's a presentation: https://www.youtube.com/watch?v=hf8kD-eAaxg
1. intermediate code is basic blocks connected with edges. Each block has a bit vector the size of which is the number of local variables. If a variable is referenced in a basic block, the corresponding bit is set.
2. basic blocks are sorted in depth first order
3. variables are sorted by "weight", which is incremented for each use, incremented by 10 for each use in a loop, by 100 for each use in a nested loop, etc.
4. code is generated with nothing assigned to registers, but registers used are marked in a bit vector, one per basic block
5. Now the register allocator allocates registers unused in a basic block to variables that are used in the basic block, in the order of the weights
6. Assigning variables to registers often means less registers are used for code generation, so more registers become available, so the process is done again until no more registers can be assigned
There are more nuances, such as variables passed to a function via registers, which introduces complications - should it stay in a register, or be moved into memory? But dealing with that is why I get paid the Big Bucks.
It's surprising that such a high-quality answer lacks a mention of the polynomial-time optimal register allocation algorithms in common use in current compilers. It's true that graph coloring is NP-complete, so you can't solve graph coloring in polynomial time, but graph coloring is register allocation with an additional constraint added: that you can never move a variable from one register to another.
Removing that constraint turns out to allow guaranteed polynomial time solutions, and that result is now widely used in practice.
I don't know what to recommend. Piddling around with Google Scholar I find lots of promising-looking survey articles about polynomial-time optimal register allocation and chordal interference graphs and the like, but I don't know which ones to recommend. Pereira's dissertation in 02008 https://homepages.dcc.ufmg.br/~fernando/publications/papers/... was what I remember as the big breakthrough I was describing above:
> Even though this is an old problem,
compilers still use heuristics to solve register allocation, or use exact algorithms
that have exponential complexity. We present an optimal, polynomial time, register assignment algorithm.
But that may not be entirely fair; for example, Hack and Goos presented a linear-time algorithm for SSA depending on special properties of SSA interference graphs in 02005 in http://web.cs.ucla.edu/~palsberg/course/cs232/papers/HackGoo..., and Pereira also presented such an algorithm in the same year.
But that was 17 years ago. What's happened in those 17 years? I've heard that polynomial-time optimal register allocation algorithms have gone into common use, but I don't know which ones!
I'd be curious to see a "high-level structured assembly language" that gives the programmer actual control over things like register allocations, in a more systematic way than C's attempt. You might say "optimizers will do a better job", and you're right, but what I want to see is a language that isn't designed to fed into an optimizing backend at all, but turned into machine code via simple, local transformations that a programmer can reliably predict. In other words, as high-level systems languages like C lean more and more heavily on optimizing backends and move away from being "portable assembly", I think that opens up a conceptual space somewhere below C yet still above assembly.
Jasmin is something like this. It is essentially a high-level assembler, will handle register allocation (but not spills) for you, has some basic control flow primitives that map 1-to-1 to assembly instructions. There is also an optional formal verification component to prove some function is equivalent to its reference , is side-channel free, etc.
You may think that’s too close to the hardware, but if you want “actual control over things like register allocation”, you will be writing your code for a specific register set size, types of (vector) registers, etc, so you’ll soon be targeting a specific CPU.
Also, was C ever “portable assembly”? There may have been a brief period where that was true, but it started as a language where programmers could fairly reliably predict what assembly the compiler would generated but that wasn’t portable, and once it had become portable, users wanted optimizations, so predicting what code it would generate was a lot harder.
> You might say "optimizers will do a better job", and you're right
That's probably why nothing good has been created to fill that space yet (that I know of). Any serious project is just going to opt for compiler optimizations.
The problem is that there are a handful of domains where optimizations need to be actively fought against, like low-level cryptography primitives. You can't write these in C reliably, so you need to drop into assembly to deliberately inhibit the optimizer, but that doesn't mean that assembly the ideal choice, only that it's the only choice.
It's nice to wish for the optimizer to do the [almost] perfect job, but sometimes that never arrives. Consider for example the case of AMD GPU ISA (GCN) generated by LLVM: it's been so far from optimal for so long, that one can lose hope that'll ever happen; and wish for a simple solution that works in the meantime.
Is the amdgcn GCC backend any better? (Or maybe final register allocation is not performed before llvm-mc is called—GCC reuses llvm-mc due to lack of binutils support.)
Can anyone with a good knowledge explain why aren't we also explicitly (with the help of compiler) managing cache as well? Register allocation is basically lowest form of cache on which you can operate on with most efficiency. Why on the next level we rely on our hardware to do the guesswork for us?
First, you can manage the cache explicitly. You can load from memory bypassing the cache (if you know you won't need something again), you can prefetch into the cache ahead of time, and you can expunge lines from the cache. Those instructions are useful in high performance code (and in Spectre-like exploits :) ).
Second, it's harder than it looks. Even a simpler problem, branch speculation, is notoriously difficult to get right without profiling information from program execution. For example the Linux kernel has the likely() and unlikely() macros which can inject branch predictor hints into the assembly. You use them like "if(likely(...))". Problem? Those hints, carefully inserted by serious systems engineers, were often terrible! So much so that removing all of them increased the performance of the kernel.
I think that you pretty much have to provide profiling information to think about having the compiler manage the cache. It's doable, but it's a lot of work - your profiles better be highly representative.
This is part of why Java performance is far better than it should, if we just consider the overhead of the JVM and the garbage collection. You get profiles for free, and that covers many sins. This is how sometimes we end up with companies that really care about latency using the JVM anyway.
>Those hints, carefully inserted by serious systems engineers, were often terrible! So much so that removing all of them increased the performance of the kernel.
So glad I'm not alone, every time I tried to optimize hot loops with these hints, it's either barely improved or flat out regressed by a larger margin.
Register allocation is a form of value analysis, while managing caches like you suggest is a form of pointer analysis which is 100 times harder. One huge problem is aliasing. It is in general very difficult to tell whether X[i] and Y[j] are distinct memory locations. So the compiler has to err on the side of caution and assume that X[i] overwrites Y[j] and vice versa. The C99 "restrict" keyword and modern improvements in language semantics help (like forbidding pointer arithmetic), but doesn't completely eliminate aliasing. You also have pointer indirection, e.g., X[Y[Z[j]]], making the analysis even harder.
In other words, state-of-the-art compilers can beat 99.9% of all developers on register allocation, but they can't beat a developer who manages cache memory explicitly.
GameBoy Advance had a CPU without cache, but with 32kbit of fast RAM on the chip. That was pretty close to a fully manual cache, and turned out to be a complete waste in practice.
It was impractical to share/reuse it like a real cache, because that would require writing some sort of memory allocator or a hash table — in software. Managing that was a hassle, and an overhead in itself. Only a few games (mostly 3D) took advantage of it, and simply by copying their hottest loop in there.
Aren't registers fixed by x86_64, while cache is a CPU hardware specific thing (e.g.: newer cpus have more cache than older ones, bit register count is fixed 8 on x86 and 16 on x86_64)?
So I think the compiler can work with registers at compile time but cant work with an unknown structure of cache
In addition to the other good answers, if the amount of state that's explicitly managed by software gets too large then it gets really expensive to save or restore that state. This happens, for example, when a syscall transfers control to the operating system. If the (many-MB) cache were software-managed, the OS would need to decide between flushing it all to main memory (expensive for quick syscalls) or leaving it in place and having OS code and data be uncached. Function calls between libraries have similar problems, how is a called function supposed to know which cache space is available for it to use? If you call the same function multiple times who's responsible for keeping its working data cached?
For a 32MB L3 cache, flushing the entire cache to memory (as would be required when switching between processes) could take over a millisecond, let alone trying to manage caches shared by multiple cores.
Cache usage won't be known statically at compile-time. If you retrieve one item from a database, it will result in a different usage of the cache than if you retrieved a different item.
Cache was designed to be program-transparent so hardware makers could add it and get better benchmarks without recompiling. Since then, it's stayed that way because it's easier for the hardware maker to change and improve it without having to change the ISA
I$ pressure from bloating machine code with such high-volume information.
Though c.f. prefetch and clflush instructions, as well as the concept (s) of non-tenporal and write-combining stores.
I find it a little amusing that the given example function does a computation which could've easily been simplified into a single instruction on x86, which needs only a single register (assuming the input x and the return value are both in eax):
lea eax, [eax+eax*4+7]
...and it's likely that a compiler would do such simplifications even before attempting register allocation, since it can only make the latter easier.
This is particularly likely to be necessary if the desired instruction is already using indirect addressing, and both register allocation and instruction selection must take those constraints into account.
As a long-time Asm programmer, I believe that instruction selection and register allocation are inseparable and really need to be considered at the same time; attempting to separate them, like what most if not all compilers do, results in (sometimes very) suboptimal results which is easily noticeable in compiler-generated vs human-generated code.
You're not wrong - in that ALL optimizations are inter-dependent. But, for reasons that I think are rather obvious, it's really hard to do them optimally, all at once. So typically you'd do things like "register-pressure-aware" selection & scheduling, repeated optimization steps (like, do a register-pressure-aware first scheduling, RA, then schedule again) etc. It's all just heuristics, in the end.
> since it can only make the latter easier.
Not necessarily. For a simple case, yes; but in general, committing to early to reuse a virtual register may have very bad performance implications, since it limits code mobility, freedom to select a particular register, and may increase register pressure in critical sections of the code (say you have "A conflicts with B, B with C and C with D"; you can put all of these in 2 registers: A=R0, B=R1, C=R0, D=R1. But if you committed too early to unify the lifetimes of A and D - eg. for instruction selection purposes - now you need 3 registers. Which is fine, and likely the best solution if you actually end up having 3 registers available - but very likely sub-optimal if you only had 2).
> attempting to separate them, like what most if not all compilers do, results in (sometimes very) suboptimal results
Not separating them would have a big disadvantage: all register allocation decisions need to be strictly local, because information about upcoming instructions and their register constraints is not available. Even simple graph coloring algorithms give much better code than algorithms with local decisions only.
In our baseline/unoptimized compiler, we do ISel+RegAlloc(+Encoding) combined in a single step and we get lots of easily avoidable moves and spills. (These typically don't hurt performance that much on modern out-of-order CPUs with store forwarding, but substantially increase the code size.)
Isn't that one of the "slow lea" cases? https://reviews.llvm.org/D32277 (I can't quite parse whether using EBP/RBP/R13 is necessary to hit the slow case, or if it's any of those bullet points that cause slowness.)
rbp/r13 are there because the funky way x86 encodes ModR/M means that they get a "+ 0" forced onto them, i.e. [rbp + rax*8] actually must be [rbp + rax*8 + 0] and as such behaves like a 3-operand lea. (same for [rbp] → [rbp+0] but that's irrelevant here as that's still two-operand)
It actually depends on what flags you pass to clang, and for a good reason. 3 term lea uses "complex decoding" and thus has higher latency (and less possible execution ports) on intel arches before icelake. If you run clang -O2 -mtune=icelake-client or -mtune=znver3 (or later architectures) it will generate the single lea instruction.
As always in optimization choices it comes down to cost modelling and trade-offs.
Perhaps this is layman's understanding, but afaik all CPU operations are only done on registers, so any variable has to be moved to registers when it's operated on (or are some operations possible on memory without going to registers?) and so the answer to "which variables" would be "all of them".
Or is the question about which variables are kept in registers for a bit longer time even while they're not actively being computed on right now?
> Perhaps this is layman's understanding, but afaik all CPU operations are only done on registers, so any variable has to be moved to registers when it's operated on
From the CPU point of view, yes, but many instructions set architectures (including the very common x86 family) have instructions in which one of the operands can be either a register or a memory location. When it's a memory location, the CPU will load it into an internal register, do the operation there, and write it back to memory if it's the destination operand; but from the point of view of the compiler, the CPU is operating directly on memory.
> (or are some operations possible on memory without going to registers?)
Some atomic memory operations (like atomic compare and exchange) can be thought of as being done directly on memory (or on the cache which sits above the memory). Some CPUs might even implement it that way (by having an "atomic memory operation" command between the CPU and the cache, instead of doing the operation on the CPU).
It's basically a caching problem. Unloading variables and saving them and loading another one because it is needed for a compuation takes time. So you want to minimize that, just like with a cache.
Now a compiler has an advantage over a pure cache in that it actually gets to read the code and so knows the exact number of variables and when they get used! whereas with a pure cache it has to be an oracle :) so, a compiler can try and optimize the cache placements (register assignments) ahead of time by analyzing the code. That's where these algorithms come into play. So knowing nothing we can safely guess that all compiler algorithms must be beating something simple like an LRU ..
It extremely depends on the target CPU. Some CPUs do have instructions that operate on RAM directly, no register required.
However, the tradeoff is that RAM is almost always 2-4 times slower than writing to registers. Most of the time it's more efficient to copy from RAM to a register and back.
Most CPUs are designed to operate on registers, but it's not too uncommon to see simple things like incrementing a variable in RAM without an intermediate copy. There might be more, I don't know a lot about x86
Reading the top voted answer made me feel like I was reading one if Jon Hannibal Stokes CPU praxis write ups from the early years of ARS. Anyone else remember those?
This is perhaps my favorite Stack Overflow answer of all time. I don't remember when I last saw such an approachable explanation of something so critical yet complicated.
> One of the canonical approaches, graph coloring, was first proposed in 1981.
This is about as far as my professor took this topic in class ~13 years ago. Nevertheless the slides that he used to illustrate how the graph coloring problem applied to register allocation stick with me to this day as one of the most elegant applications of CS I've ever seen (which are admittedly few as I'm not nearly as studied as I ought to be).
> Code generation is a surprisingly challenging and underappreciated aspect of compiler implementation, and quite a lot can happen under the hood even after a compiler’s IR optimization pipeline has finished. Register allocation is one of those things, and like many compilers topics, entire books could be written on it alone.
Our class final project targeted a register-based VM [1] for this exact reason. I also found writing MIPS assembly simpler than Y86 [2] because of the larger number of registers at my disposal (among the other obvious differences).
[1] https://www.lua.org/doc/jucs05.pdf
[2] https://esolangs.org/wiki/Y86
> Stack Overflow
This is on 'Programming Language Design and Implementation Stack Exchange'[0] -- I'm not pointing this out to tell you that you're wrong -- I think the various 'Stacks Exchange' often have better, more thoughtful answers than the average Stack Overflow question.
[0] really rolls off the tongue
For more theoretical and pure CS stuff, yes they do since some point between ~2012-5 when SO got overrun by web developers and people grinding coding challenges. And of course the mushrooming number of specialist SE network sites [https://stackexchange.com/sites] like PLD, Theoretical CS, CodeReview, Computational Science, DBA, SWE, CrossValidated etc. + SO in several non-English languages [https://stackoverflow.com/help/non-english-questions].
Even the standards of behavior between different tags on SO itself vary greatly (in terms of how well-written a question is, whether it has an MCVE, whether the OP check it wasn't a dupe and searched existing Q&A, etc.).
If you want to chronicle the descent, look at Meta posts about "give me teh codez"-type questions [https://meta.stackoverflow.com/search?q=%22give+me+teh+codez...].
Stackexchanges are some of favourite things to idly read.
Short, snappy, nice LaTeX and so on.
And then you sometimes you have just sublimely creative people answering e.g. Ron Maimon's answers on physics SE are a goldmine
Most StackExchange websites can also be read offline, since Kiwix regularly archives them:
https://library.kiwix.org/#lang=eng&category=stack_exchange
This is actually also the main thing I have ChatGPT write for me e.g. I can have it dial the level of mathematics to exactly where I can be bothered to tolerate (sometimes life is too short for symbols)
Amen to that. It’s amazing (to me) the difference in cultures of the different exchanges. I see some of the most awesome answers and explanations on aviation SE. Whereas I rarely try to post (answer or question) to SO anymore, because it feels like navigating the DMV.
Unfortunately, just like with parser generators and more powerful parsing algorithms (in particular bottom-up LR and such), the practice proved very different than the theory. Linear-scan and variants of it have become the norm for register allocation.
GCC uses a region-based allocator with graph coloring, based to some extent on Callahan/Koblenz's work I think: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/ira.cc
Note that in the GCC context, LRA means Local Register Allocator, not linear scan: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/lra.cc
(There was much more talk recently of GCC's LRA than IRA because completing the reload-to-LRA transition in the compiler threatened the removal of some targets still without reload support.)
I've had a lot of success using chordal graph allocators. They provide plenty of extra dimensions of 'relaxation' to tune them, they're incremental (so they allow pinning), and they decay nicely when their constraints are violated. Because of their incremental nature & "niceness" of decay, they can be forced into a nice hierarchical form ("middle out" on the loops). The main driving algorithm (maximum cardinality search) is a little harebrained; but, if you just relax and write up the code, you'll find it is surprisingly short & robust, and highly amenable to unit testing.
Spilling/filling is a bit exciting, since chordal coloring doesn't provide a lot of direction, but I've found that pressure heuristics fill in the gap nicely. The whole thing relies on having a robust interference graph — which more than kind of sucks — but, we don't get into compilers unless we've weaponized our bit-set data-structures in the first place.
Is this true though? Last time I worked on a compiler (admittedly quite a few years ago) Briggs was the bare minimum; our compiler in particular used an improvement over Callahan's hierarchical register allocation (the basic idea of which is that you should prioritize allocation in innermost loops, over a better "global" graph coloring, since spilling once in the inner loop costs way more than spilling several registers in the linear/ setup part of the code).
I would expect that only compilers for immature languages (that don't care about optimization) use naive RA.
Or JIT compilers, where compilation time is an important factor -- e.g. V8 uses variations on linear scan.
At least GCC appears to use a graph coloring algorithm. LLVM seems to have moved from linear scan to a custom algorithm in version 3; I have no idea what they're using nowadays.
LLVM now uses something they call the "Greedy Register Allocator". As far as I can tell, it's a variation on the linear allocator with some heuristics. Here's a presentation: https://www.youtube.com/watch?v=hf8kD-eAaxg
Oh, it's from Alexis King. No wonder it's written so well.
Compilers are of course on of the purest applications of theoretical computer science.
Just that none of the parsing methods I learned are actually used commonly in most real compilers.
Applying pure computer science to real-world problems is of course its own field of study ;)
[flagged]
It doesn't surprise me much that this was written by the same author as "Parse, don't validate", very well written
[0]: https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...
[1]: all previous threads https://news.ycombinator.com/from?site=lexi-lambda.github.io
I didn't even check the author before reading this comment. She is a great writer and super talented.
The Digital Mars D compiler register allocator:
1. intermediate code is basic blocks connected with edges. Each block has a bit vector the size of which is the number of local variables. If a variable is referenced in a basic block, the corresponding bit is set.
2. basic blocks are sorted in depth first order
3. variables are sorted by "weight", which is incremented for each use, incremented by 10 for each use in a loop, by 100 for each use in a nested loop, etc.
4. code is generated with nothing assigned to registers, but registers used are marked in a bit vector, one per basic block
5. Now the register allocator allocates registers unused in a basic block to variables that are used in the basic block, in the order of the weights
6. Assigning variables to registers often means less registers are used for code generation, so more registers become available, so the process is done again until no more registers can be assigned
There are more nuances, such as variables passed to a function via registers, which introduces complications - should it stay in a register, or be moved into memory? But dealing with that is why I get paid the Big Bucks.
I always enjoy your explanations of how you've implemented features in D
I enjoy writing them!
I've heard of people trying out AI for register allocation. I wonder how that is going.
Most people find my code generator impenetrable, but actually it is trivially obvious to the most casual observer.
It's been an adventure crowbaring it into working for AArch64.
It's surprising that such a high-quality answer lacks a mention of the polynomial-time optimal register allocation algorithms in common use in current compilers. It's true that graph coloring is NP-complete, so you can't solve graph coloring in polynomial time, but graph coloring is register allocation with an additional constraint added: that you can never move a variable from one register to another.
Removing that constraint turns out to allow guaranteed polynomial time solutions, and that result is now widely used in practice.
Would you have references?
I don't know what to recommend. Piddling around with Google Scholar I find lots of promising-looking survey articles about polynomial-time optimal register allocation and chordal interference graphs and the like, but I don't know which ones to recommend. Pereira's dissertation in 02008 https://homepages.dcc.ufmg.br/~fernando/publications/papers/... was what I remember as the big breakthrough I was describing above:
> Even though this is an old problem, compilers still use heuristics to solve register allocation, or use exact algorithms that have exponential complexity. We present an optimal, polynomial time, register assignment algorithm.
Pereira and Palsberg presented a shorter version at PLDI that year; the paper is at https://llvm.org/pubs/2008-06-PLDI-PuzzleSolving.html.
But that may not be entirely fair; for example, Hack and Goos presented a linear-time algorithm for SSA depending on special properties of SSA interference graphs in 02005 in http://web.cs.ucla.edu/~palsberg/course/cs232/papers/HackGoo..., and Pereira also presented such an algorithm in the same year.
But that was 17 years ago. What's happened in those 17 years? I've heard that polynomial-time optimal register allocation algorithms have gone into common use, but I don't know which ones!
If you find good references, please let me know!
I'd be curious to see a "high-level structured assembly language" that gives the programmer actual control over things like register allocations, in a more systematic way than C's attempt. You might say "optimizers will do a better job", and you're right, but what I want to see is a language that isn't designed to fed into an optimizing backend at all, but turned into machine code via simple, local transformations that a programmer can reliably predict. In other words, as high-level systems languages like C lean more and more heavily on optimizing backends and move away from being "portable assembly", I think that opens up a conceptual space somewhere below C yet still above assembly.
Jasmin is something like this. It is essentially a high-level assembler, will handle register allocation (but not spills) for you, has some basic control flow primitives that map 1-to-1 to assembly instructions. There is also an optional formal verification component to prove some function is equivalent to its reference , is side-channel free, etc.
[1] https://github.com/jasmin-lang/jasmin/wiki
Thanks, this looks quite interesting.
I think you want a macro assembler.
You may think that’s too close to the hardware, but if you want “actual control over things like register allocation”, you will be writing your code for a specific register set size, types of (vector) registers, etc, so you’ll soon be targeting a specific CPU.
Also, was C ever “portable assembly”? There may have been a brief period where that was true, but it started as a language where programmers could fairly reliably predict what assembly the compiler would generated but that wasn’t portable, and once it had become portable, users wanted optimizations, so predicting what code it would generate was a lot harder.
> You might say "optimizers will do a better job", and you're right
That's probably why nothing good has been created to fill that space yet (that I know of). Any serious project is just going to opt for compiler optimizations.
The problem is that there are a handful of domains where optimizations need to be actively fought against, like low-level cryptography primitives. You can't write these in C reliably, so you need to drop into assembly to deliberately inhibit the optimizer, but that doesn't mean that assembly the ideal choice, only that it's the only choice.
I second this.
It's nice to wish for the optimizer to do the [almost] perfect job, but sometimes that never arrives. Consider for example the case of AMD GPU ISA (GCN) generated by LLVM: it's been so far from optimal for so long, that one can lose hope that'll ever happen; and wish for a simple solution that works in the meantime.
Is the amdgcn GCC backend any better? (Or maybe final register allocation is not performed before llvm-mc is called—GCC reuses llvm-mc due to lack of binutils support.)
Sounds like Forth to me.
Can anyone with a good knowledge explain why aren't we also explicitly (with the help of compiler) managing cache as well? Register allocation is basically lowest form of cache on which you can operate on with most efficiency. Why on the next level we rely on our hardware to do the guesswork for us?
First, you can manage the cache explicitly. You can load from memory bypassing the cache (if you know you won't need something again), you can prefetch into the cache ahead of time, and you can expunge lines from the cache. Those instructions are useful in high performance code (and in Spectre-like exploits :) ).
Second, it's harder than it looks. Even a simpler problem, branch speculation, is notoriously difficult to get right without profiling information from program execution. For example the Linux kernel has the likely() and unlikely() macros which can inject branch predictor hints into the assembly. You use them like "if(likely(...))". Problem? Those hints, carefully inserted by serious systems engineers, were often terrible! So much so that removing all of them increased the performance of the kernel.
I think that you pretty much have to provide profiling information to think about having the compiler manage the cache. It's doable, but it's a lot of work - your profiles better be highly representative.
This is part of why Java performance is far better than it should, if we just consider the overhead of the JVM and the garbage collection. You get profiles for free, and that covers many sins. This is how sometimes we end up with companies that really care about latency using the JVM anyway.
>Those hints, carefully inserted by serious systems engineers, were often terrible! So much so that removing all of them increased the performance of the kernel.
So glad I'm not alone, every time I tried to optimize hot loops with these hints, it's either barely improved or flat out regressed by a larger margin.
Register allocation is a form of value analysis, while managing caches like you suggest is a form of pointer analysis which is 100 times harder. One huge problem is aliasing. It is in general very difficult to tell whether X[i] and Y[j] are distinct memory locations. So the compiler has to err on the side of caution and assume that X[i] overwrites Y[j] and vice versa. The C99 "restrict" keyword and modern improvements in language semantics help (like forbidding pointer arithmetic), but doesn't completely eliminate aliasing. You also have pointer indirection, e.g., X[Y[Z[j]]], making the analysis even harder.
In other words, state-of-the-art compilers can beat 99.9% of all developers on register allocation, but they can't beat a developer who manages cache memory explicitly.
GameBoy Advance had a CPU without cache, but with 32kbit of fast RAM on the chip. That was pretty close to a fully manual cache, and turned out to be a complete waste in practice.
It was impractical to share/reuse it like a real cache, because that would require writing some sort of memory allocator or a hash table — in software. Managing that was a hassle, and an overhead in itself. Only a few games (mostly 3D) took advantage of it, and simply by copying their hottest loop in there.
Aren't registers fixed by x86_64, while cache is a CPU hardware specific thing (e.g.: newer cpus have more cache than older ones, bit register count is fixed 8 on x86 and 16 on x86_64)?
So I think the compiler can work with registers at compile time but cant work with an unknown structure of cache
In addition to the other good answers, if the amount of state that's explicitly managed by software gets too large then it gets really expensive to save or restore that state. This happens, for example, when a syscall transfers control to the operating system. If the (many-MB) cache were software-managed, the OS would need to decide between flushing it all to main memory (expensive for quick syscalls) or leaving it in place and having OS code and data be uncached. Function calls between libraries have similar problems, how is a called function supposed to know which cache space is available for it to use? If you call the same function multiple times who's responsible for keeping its working data cached? For a 32MB L3 cache, flushing the entire cache to memory (as would be required when switching between processes) could take over a millisecond, let alone trying to manage caches shared by multiple cores.
Cache usage won't be known statically at compile-time. If you retrieve one item from a database, it will result in a different usage of the cache than if you retrieved a different item.
Cache was designed to be program-transparent so hardware makers could add it and get better benchmarks without recompiling. Since then, it's stayed that way because it's easier for the hardware maker to change and improve it without having to change the ISA
I$ pressure from bloating machine code with such high-volume information. Though c.f. prefetch and clflush instructions, as well as the concept (s) of non-tenporal and write-combining stores.
I find it a little amusing that the given example function does a computation which could've easily been simplified into a single instruction on x86, which needs only a single register (assuming the input x and the return value are both in eax):
...and it's likely that a compiler would do such simplifications even before attempting register allocation, since it can only make the latter easier.This is particularly likely to be necessary if the desired instruction is already using indirect addressing, and both register allocation and instruction selection must take those constraints into account.
As a long-time Asm programmer, I believe that instruction selection and register allocation are inseparable and really need to be considered at the same time; attempting to separate them, like what most if not all compilers do, results in (sometimes very) suboptimal results which is easily noticeable in compiler-generated vs human-generated code.
You're not wrong - in that ALL optimizations are inter-dependent. But, for reasons that I think are rather obvious, it's really hard to do them optimally, all at once. So typically you'd do things like "register-pressure-aware" selection & scheduling, repeated optimization steps (like, do a register-pressure-aware first scheduling, RA, then schedule again) etc. It's all just heuristics, in the end.
> since it can only make the latter easier.
Not necessarily. For a simple case, yes; but in general, committing to early to reuse a virtual register may have very bad performance implications, since it limits code mobility, freedom to select a particular register, and may increase register pressure in critical sections of the code (say you have "A conflicts with B, B with C and C with D"; you can put all of these in 2 registers: A=R0, B=R1, C=R0, D=R1. But if you committed too early to unify the lifetimes of A and D - eg. for instruction selection purposes - now you need 3 registers. Which is fine, and likely the best solution if you actually end up having 3 registers available - but very likely sub-optimal if you only had 2).
> attempting to separate them, like what most if not all compilers do, results in (sometimes very) suboptimal results
Not separating them would have a big disadvantage: all register allocation decisions need to be strictly local, because information about upcoming instructions and their register constraints is not available. Even simple graph coloring algorithms give much better code than algorithms with local decisions only.
In our baseline/unoptimized compiler, we do ISel+RegAlloc(+Encoding) combined in a single step and we get lots of easily avoidable moves and spills. (These typically don't hurt performance that much on modern out-of-order CPUs with store forwarding, but substantially increase the code size.)
Isn't that one of the "slow lea" cases? https://reviews.llvm.org/D32277 (I can't quite parse whether using EBP/RBP/R13 is necessary to hit the slow case, or if it's any of those bullet points that cause slowness.)
rbp/r13 are there because the funky way x86 encodes ModR/M means that they get a "+ 0" forced onto them, i.e. [rbp + rax*8] actually must be [rbp + rax*8 + 0] and as such behaves like a 3-operand lea. (same for [rbp] → [rbp+0] but that's irrelevant here as that's still two-operand)
Interestingly clang does not lower down to that (instead it multiplies by 7 in a separate step) while GCC does.
It actually depends on what flags you pass to clang, and for a good reason. 3 term lea uses "complex decoding" and thus has higher latency (and less possible execution ports) on intel arches before icelake. If you run clang -O2 -mtune=icelake-client or -mtune=znver3 (or later architectures) it will generate the single lea instruction.
As always in optimization choices it comes down to cost modelling and trade-offs.
Interesting. I guess GCC’s cost modeling is different, then? Or does it default to a newer machine?
A compiler would probably emit shift-and-add by constants. Which may very well be faster than a lea.
Yeah, but x86 is perverse like that :^D
Perhaps this is layman's understanding, but afaik all CPU operations are only done on registers, so any variable has to be moved to registers when it's operated on (or are some operations possible on memory without going to registers?) and so the answer to "which variables" would be "all of them".
Or is the question about which variables are kept in registers for a bit longer time even while they're not actively being computed on right now?
> Perhaps this is layman's understanding, but afaik all CPU operations are only done on registers, so any variable has to be moved to registers when it's operated on
From the CPU point of view, yes, but many instructions set architectures (including the very common x86 family) have instructions in which one of the operands can be either a register or a memory location. When it's a memory location, the CPU will load it into an internal register, do the operation there, and write it back to memory if it's the destination operand; but from the point of view of the compiler, the CPU is operating directly on memory.
> (or are some operations possible on memory without going to registers?)
Some atomic memory operations (like atomic compare and exchange) can be thought of as being done directly on memory (or on the cache which sits above the memory). Some CPUs might even implement it that way (by having an "atomic memory operation" command between the CPU and the cache, instead of doing the operation on the CPU).
It's basically a caching problem. Unloading variables and saving them and loading another one because it is needed for a compuation takes time. So you want to minimize that, just like with a cache.
Now a compiler has an advantage over a pure cache in that it actually gets to read the code and so knows the exact number of variables and when they get used! whereas with a pure cache it has to be an oracle :) so, a compiler can try and optimize the cache placements (register assignments) ahead of time by analyzing the code. That's where these algorithms come into play. So knowing nothing we can safely guess that all compiler algorithms must be beating something simple like an LRU ..
It extremely depends on the target CPU. Some CPUs do have instructions that operate on RAM directly, no register required.
However, the tradeoff is that RAM is almost always 2-4 times slower than writing to registers. Most of the time it's more efficient to copy from RAM to a register and back.
Most CPUs are designed to operate on registers, but it's not too uncommon to see simple things like incrementing a variable in RAM without an intermediate copy. There might be more, I don't know a lot about x86
That's not 100% correct but it's correct enough to give sufficient intuition to understand the problem.
> all CPU operations are only done on registers
Not correct for the X86_64, such as:
https://www.felixcloutier.com/x86/incEven 6502 from 1975 has it (and a few more like shift/rotate instructions).
Registers vs x87 stack with free swap a.k.a. rotating stack. Sub-optimal edge cases for both exist:
[1] http://cr.yp.to/qhasm/20050210-fxch.txt
[2] https://pvk.ca/Blog/2014/03/15/sbcl-the-ultimate-assembly-co...
Reading the top voted answer made me feel like I was reading one if Jon Hannibal Stokes CPU praxis write ups from the early years of ARS. Anyone else remember those?