alexjplant 3 days ago

This is perhaps my favorite Stack Overflow answer of all time. I don't remember when I last saw such an approachable explanation of something so critical yet complicated.

> One of the canonical approaches, graph coloring, was first proposed in 1981.

This is about as far as my professor took this topic in class ~13 years ago. Nevertheless the slides that he used to illustrate how the graph coloring problem applied to register allocation stick with me to this day as one of the most elegant applications of CS I've ever seen (which are admittedly few as I'm not nearly as studied as I ought to be).

> Code generation is a surprisingly challenging and underappreciated aspect of compiler implementation, and quite a lot can happen under the hood even after a compiler’s IR optimization pipeline has finished. Register allocation is one of those things, and like many compilers topics, entire books could be written on it alone.

Our class final project targeted a register-based VM [1] for this exact reason. I also found writing MIPS assembly simpler than Y86 [2] because of the larger number of registers at my disposal (among the other obvious differences).

[1] https://www.lua.org/doc/jucs05.pdf

[2] https://esolangs.org/wiki/Y86

  • qwery 3 days ago

    > Stack Overflow

    This is on 'Programming Language Design and Implementation Stack Exchange'[0] -- I'm not pointing this out to tell you that you're wrong -- I think the various 'Stacks Exchange' often have better, more thoughtful answers than the average Stack Overflow question.

    [0] really rolls off the tongue

    • smcin 3 days ago

      For more theoretical and pure CS stuff, yes they do since some point between ~2012-5 when SO got overrun by web developers and people grinding coding challenges. And of course the mushrooming number of specialist SE network sites [https://stackexchange.com/sites] like PLD, Theoretical CS, CodeReview, Computational Science, DBA, SWE, CrossValidated etc. + SO in several non-English languages [https://stackoverflow.com/help/non-english-questions].

      Even the standards of behavior between different tags on SO itself vary greatly (in terms of how well-written a question is, whether it has an MCVE, whether the OP check it wasn't a dupe and searched existing Q&A, etc.).

      If you want to chronicle the descent, look at Meta posts about "give me teh codez"-type questions [https://meta.stackoverflow.com/search?q=%22give+me+teh+codez...].

    • mhh__ 3 days ago

      Stackexchanges are some of favourite things to idly read.

      Short, snappy, nice LaTeX and so on.

      And then you sometimes you have just sublimely creative people answering e.g. Ron Maimon's answers on physics SE are a goldmine

      • mhh__ 3 days ago

        This is actually also the main thing I have ChatGPT write for me e.g. I can have it dial the level of mathematics to exactly where I can be bothered to tolerate (sometimes life is too short for symbols)

    • travisgriggs 3 days ago

      Amen to that. It’s amazing (to me) the difference in cultures of the different exchanges. I see some of the most awesome answers and explanations on aviation SE. Whereas I rarely try to post (answer or question) to SO anymore, because it feels like navigating the DMV.

  • userbinator 3 days ago

    Unfortunately, just like with parser generators and more powerful parsing algorithms (in particular bottom-up LR and such), the practice proved very different than the theory. Linear-scan and variants of it have become the norm for register allocation.

    • fweimer 3 days ago

      GCC uses a region-based allocator with graph coloring, based to some extent on Callahan/Koblenz's work I think: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/ira.cc

      Note that in the GCC context, LRA means Local Register Allocator, not linear scan: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/lra.cc

      (There was much more talk recently of GCC's LRA than IRA because completing the reload-to-LRA transition in the compiler threatened the removal of some targets still without reload support.)

      • thechao 3 days ago

        I've had a lot of success using chordal graph allocators. They provide plenty of extra dimensions of 'relaxation' to tune them, they're incremental (so they allow pinning), and they decay nicely when their constraints are violated. Because of their incremental nature & "niceness" of decay, they can be forced into a nice hierarchical form ("middle out" on the loops). The main driving algorithm (maximum cardinality search) is a little harebrained; but, if you just relax and write up the code, you'll find it is surprisingly short & robust, and highly amenable to unit testing.

        Spilling/filling is a bit exciting, since chordal coloring doesn't provide a lot of direction, but I've found that pressure heuristics fill in the gap nicely. The whole thing relies on having a robust interference graph — which more than kind of sucks — but, we don't get into compilers unless we've weaponized our bit-set data-structures in the first place.

    • virgilp 3 days ago

      Is this true though? Last time I worked on a compiler (admittedly quite a few years ago) Briggs was the bare minimum; our compiler in particular used an improvement over Callahan's hierarchical register allocation (the basic idea of which is that you should prioritize allocation in innermost loops, over a better "global" graph coloring, since spilling once in the inner loop costs way more than spilling several registers in the linear/ setup part of the code).

      I would expect that only compilers for immature languages (that don't care about optimization) use naive RA.

      • Leszek 3 days ago

        Or JIT compilers, where compilation time is an important factor -- e.g. V8 uses variations on linear scan.

    • Sharlin 3 days ago

      At least GCC appears to use a graph coloring algorithm. LLVM seems to have moved from linear scan to a custom algorithm in version 3; I have no idea what they're using nowadays.

  • chowells 3 days ago

    Oh, it's from Alexis King. No wonder it's written so well.

  • saagarjha 3 days ago

    Compilers are of course on of the purest applications of theoretical computer science.

    • stkdump 3 days ago

      Just that none of the parsing methods I learned are actually used commonly in most real compilers.

      • saagarjha 21 hours ago

        Applying pure computer science to real-world problems is of course its own field of study ;)

WalterBright 3 days ago

The Digital Mars D compiler register allocator:

1. intermediate code is basic blocks connected with edges. Each block has a bit vector the size of which is the number of local variables. If a variable is referenced in a basic block, the corresponding bit is set.

2. basic blocks are sorted in depth first order

3. variables are sorted by "weight", which is incremented for each use, incremented by 10 for each use in a loop, by 100 for each use in a nested loop, etc.

4. code is generated with nothing assigned to registers, but registers used are marked in a bit vector, one per basic block

5. Now the register allocator allocates registers unused in a basic block to variables that are used in the basic block, in the order of the weights

6. Assigning variables to registers often means less registers are used for code generation, so more registers become available, so the process is done again until no more registers can be assigned

There are more nuances, such as variables passed to a function via registers, which introduces complications - should it stay in a register, or be moved into memory? But dealing with that is why I get paid the Big Bucks.

  • anitil 2 days ago

    I always enjoy your explanations of how you've implemented features in D

    • WalterBright 2 days ago

      I enjoy writing them!

      I've heard of people trying out AI for register allocation. I wonder how that is going.

      Most people find my code generator impenetrable, but actually it is trivially obvious to the most casual observer.

      It's been an adventure crowbaring it into working for AArch64.

kragen 2 days ago

It's surprising that such a high-quality answer lacks a mention of the polynomial-time optimal register allocation algorithms in common use in current compilers. It's true that graph coloring is NP-complete, so you can't solve graph coloring in polynomial time, but graph coloring is register allocation with an additional constraint added: that you can never move a variable from one register to another.

Removing that constraint turns out to allow guaranteed polynomial time solutions, and that result is now widely used in practice.

  • Agingcoder a day ago

    Would you have references?

    • kragen a day ago

      I don't know what to recommend. Piddling around with Google Scholar I find lots of promising-looking survey articles about polynomial-time optimal register allocation and chordal interference graphs and the like, but I don't know which ones to recommend. Pereira's dissertation in 02008 https://homepages.dcc.ufmg.br/~fernando/publications/papers/... was what I remember as the big breakthrough I was describing above:

      > Even though this is an old problem, compilers still use heuristics to solve register allocation, or use exact algorithms that have exponential complexity. We present an optimal, polynomial time, register assignment algorithm.

      Pereira and Palsberg presented a shorter version at PLDI that year; the paper is at https://llvm.org/pubs/2008-06-PLDI-PuzzleSolving.html.

      But that may not be entirely fair; for example, Hack and Goos presented a linear-time algorithm for SSA depending on special properties of SSA interference graphs in 02005 in http://web.cs.ucla.edu/~palsberg/course/cs232/papers/HackGoo..., and Pereira also presented such an algorithm in the same year.

      But that was 17 years ago. What's happened in those 17 years? I've heard that polynomial-time optimal register allocation algorithms have gone into common use, but I don't know which ones!

      If you find good references, please let me know!

kibwen 3 days ago

I'd be curious to see a "high-level structured assembly language" that gives the programmer actual control over things like register allocations, in a more systematic way than C's attempt. You might say "optimizers will do a better job", and you're right, but what I want to see is a language that isn't designed to fed into an optimizing backend at all, but turned into machine code via simple, local transformations that a programmer can reliably predict. In other words, as high-level systems languages like C lean more and more heavily on optimizing backends and move away from being "portable assembly", I think that opens up a conceptual space somewhere below C yet still above assembly.

  • pbsd 3 days ago

    Jasmin is something like this. It is essentially a high-level assembler, will handle register allocation (but not spills) for you, has some basic control flow primitives that map 1-to-1 to assembly instructions. There is also an optional formal verification component to prove some function is equivalent to its reference , is side-channel free, etc.

    [1] https://github.com/jasmin-lang/jasmin/wiki

    • kibwen 3 days ago

      Thanks, this looks quite interesting.

  • Someone 3 days ago

    I think you want a macro assembler.

    You may think that’s too close to the hardware, but if you want “actual control over things like register allocation”, you will be writing your code for a specific register set size, types of (vector) registers, etc, so you’ll soon be targeting a specific CPU.

    Also, was C ever “portable assembly”? There may have been a brief period where that was true, but it started as a language where programmers could fairly reliably predict what assembly the compiler would generated but that wasn’t portable, and once it had become portable, users wanted optimizations, so predicting what code it would generate was a lot harder.

  • seanw444 3 days ago

    > You might say "optimizers will do a better job", and you're right

    That's probably why nothing good has been created to fill that space yet (that I know of). Any serious project is just going to opt for compiler optimizations.

    • kibwen 3 days ago

      The problem is that there are a handful of domains where optimizations need to be actively fought against, like low-level cryptography primitives. You can't write these in C reliably, so you need to drop into assembly to deliberately inhibit the optimizer, but that doesn't mean that assembly the ideal choice, only that it's the only choice.

  • mpreda 3 days ago

    I second this.

    It's nice to wish for the optimizer to do the [almost] perfect job, but sometimes that never arrives. Consider for example the case of AMD GPU ISA (GCN) generated by LLVM: it's been so far from optimal for so long, that one can lose hope that'll ever happen; and wish for a simple solution that works in the meantime.

    • fweimer 3 days ago

      Is the amdgcn GCC backend any better? (Or maybe final register allocation is not performed before llvm-mc is called—GCC reuses llvm-mc due to lack of binutils support.)

  • thuanao 3 days ago

    Sounds like Forth to me.

artemonster 3 days ago

Can anyone with a good knowledge explain why aren't we also explicitly (with the help of compiler) managing cache as well? Register allocation is basically lowest form of cache on which you can operate on with most efficiency. Why on the next level we rely on our hardware to do the guesswork for us?

  • H8crilA 3 days ago

    First, you can manage the cache explicitly. You can load from memory bypassing the cache (if you know you won't need something again), you can prefetch into the cache ahead of time, and you can expunge lines from the cache. Those instructions are useful in high performance code (and in Spectre-like exploits :) ).

    Second, it's harder than it looks. Even a simpler problem, branch speculation, is notoriously difficult to get right without profiling information from program execution. For example the Linux kernel has the likely() and unlikely() macros which can inject branch predictor hints into the assembly. You use them like "if(likely(...))". Problem? Those hints, carefully inserted by serious systems engineers, were often terrible! So much so that removing all of them increased the performance of the kernel.

    I think that you pretty much have to provide profiling information to think about having the compiler manage the cache. It's doable, but it's a lot of work - your profiles better be highly representative.

    • hibikir 3 days ago

      This is part of why Java performance is far better than it should, if we just consider the overhead of the JVM and the garbage collection. You get profiles for free, and that covers many sins. This is how sometimes we end up with companies that really care about latency using the JVM anyway.

    • fuhsnn 3 days ago

      >Those hints, carefully inserted by serious systems engineers, were often terrible! So much so that removing all of them increased the performance of the kernel.

      So glad I'm not alone, every time I tried to optimize hot loops with these hints, it's either barely improved or flat out regressed by a larger margin.

  • bjourne 3 days ago

    Register allocation is a form of value analysis, while managing caches like you suggest is a form of pointer analysis which is 100 times harder. One huge problem is aliasing. It is in general very difficult to tell whether X[i] and Y[j] are distinct memory locations. So the compiler has to err on the side of caution and assume that X[i] overwrites Y[j] and vice versa. The C99 "restrict" keyword and modern improvements in language semantics help (like forbidding pointer arithmetic), but doesn't completely eliminate aliasing. You also have pointer indirection, e.g., X[Y[Z[j]]], making the analysis even harder.

    In other words, state-of-the-art compilers can beat 99.9% of all developers on register allocation, but they can't beat a developer who manages cache memory explicitly.

  • pornel 2 days ago

    GameBoy Advance had a CPU without cache, but with 32kbit of fast RAM on the chip. That was pretty close to a fully manual cache, and turned out to be a complete waste in practice.

    It was impractical to share/reuse it like a real cache, because that would require writing some sort of memory allocator or a hash table — in software. Managing that was a hassle, and an overhead in itself. Only a few games (mostly 3D) took advantage of it, and simply by copying their hottest loop in there.

  • zbendefy 3 days ago

    Aren't registers fixed by x86_64, while cache is a CPU hardware specific thing (e.g.: newer cpus have more cache than older ones, bit register count is fixed 8 on x86 and 16 on x86_64)?

    So I think the compiler can work with registers at compile time but cant work with an unknown structure of cache

  • HALtheWise 2 days ago

    In addition to the other good answers, if the amount of state that's explicitly managed by software gets too large then it gets really expensive to save or restore that state. This happens, for example, when a syscall transfers control to the operating system. If the (many-MB) cache were software-managed, the OS would need to decide between flushing it all to main memory (expensive for quick syscalls) or leaving it in place and having OS code and data be uncached. Function calls between libraries have similar problems, how is a called function supposed to know which cache space is available for it to use? If you call the same function multiple times who's responsible for keeping its working data cached? For a 32MB L3 cache, flushing the entire cache to memory (as would be required when switching between processes) could take over a millisecond, let alone trying to manage caches shared by multiple cores.

  • mrkeen 3 days ago

    Cache usage won't be known statically at compile-time. If you retrieve one item from a database, it will result in a different usage of the cache than if you retrieved a different item.

  • ranger207 3 days ago

    Cache was designed to be program-transparent so hardware makers could add it and get better benchmarks without recompiling. Since then, it's stayed that way because it's easier for the hardware maker to change and improve it without having to change the ISA

  • namibj 3 days ago

    I$ pressure from bloating machine code with such high-volume information. Though c.f. prefetch and clflush instructions, as well as the concept (s) of non-tenporal and write-combining stores.

userbinator 3 days ago

I find it a little amusing that the given example function does a computation which could've easily been simplified into a single instruction on x86, which needs only a single register (assuming the input x and the return value are both in eax):

    lea eax, [eax+eax*4+7]
...and it's likely that a compiler would do such simplifications even before attempting register allocation, since it can only make the latter easier.

This is particularly likely to be necessary if the desired instruction is already using indirect addressing, and both register allocation and instruction selection must take those constraints into account.

As a long-time Asm programmer, I believe that instruction selection and register allocation are inseparable and really need to be considered at the same time; attempting to separate them, like what most if not all compilers do, results in (sometimes very) suboptimal results which is easily noticeable in compiler-generated vs human-generated code.

  • virgilp 3 days ago

    You're not wrong - in that ALL optimizations are inter-dependent. But, for reasons that I think are rather obvious, it's really hard to do them optimally, all at once. So typically you'd do things like "register-pressure-aware" selection & scheduling, repeated optimization steps (like, do a register-pressure-aware first scheduling, RA, then schedule again) etc. It's all just heuristics, in the end.

    > since it can only make the latter easier.

    Not necessarily. For a simple case, yes; but in general, committing to early to reuse a virtual register may have very bad performance implications, since it limits code mobility, freedom to select a particular register, and may increase register pressure in critical sections of the code (say you have "A conflicts with B, B with C and C with D"; you can put all of these in 2 registers: A=R0, B=R1, C=R0, D=R1. But if you committed too early to unify the lifetimes of A and D - eg. for instruction selection purposes - now you need 3 registers. Which is fine, and likely the best solution if you actually end up having 3 registers available - but very likely sub-optimal if you only had 2).

  • aengelke 3 days ago

    > attempting to separate them, like what most if not all compilers do, results in (sometimes very) suboptimal results

    Not separating them would have a big disadvantage: all register allocation decisions need to be strictly local, because information about upcoming instructions and their register constraints is not available. Even simple graph coloring algorithms give much better code than algorithms with local decisions only.

    In our baseline/unoptimized compiler, we do ISel+RegAlloc(+Encoding) combined in a single step and we get lots of easily avoidable moves and spills. (These typically don't hurt performance that much on modern out-of-order CPUs with store forwarding, but substantially increase the code size.)

  • rwmj 3 days ago

    Isn't that one of the "slow lea" cases? https://reviews.llvm.org/D32277 (I can't quite parse whether using EBP/RBP/R13 is necessary to hit the slow case, or if it's any of those bullet points that cause slowness.)

    • dzaima 3 days ago

      rbp/r13 are there because the funky way x86 encodes ModR/M means that they get a "+ 0" forced onto them, i.e. [rbp + rax*8] actually must be [rbp + rax*8 + 0] and as such behaves like a 3-operand lea. (same for [rbp] → [rbp+0] but that's irrelevant here as that's still two-operand)

  • saagarjha 3 days ago

    Interestingly clang does not lower down to that (instead it multiplies by 7 in a separate step) while GCC does.

    • thebolt00 3 days ago

      It actually depends on what flags you pass to clang, and for a good reason. 3 term lea uses "complex decoding" and thus has higher latency (and less possible execution ports) on intel arches before icelake. If you run clang -O2 -mtune=icelake-client or -mtune=znver3 (or later architectures) it will generate the single lea instruction.

      As always in optimization choices it comes down to cost modelling and trade-offs.

      • saagarjha 21 hours ago

        Interesting. I guess GCC’s cost modeling is different, then? Or does it default to a newer machine?

  • bjourne 3 days ago

    A compiler would probably emit shift-and-add by constants. Which may very well be faster than a lea.

  • Sharlin 3 days ago

    Yeah, but x86 is perverse like that :^D

Aardwolf 3 days ago

Perhaps this is layman's understanding, but afaik all CPU operations are only done on registers, so any variable has to be moved to registers when it's operated on (or are some operations possible on memory without going to registers?) and so the answer to "which variables" would be "all of them".

Or is the question about which variables are kept in registers for a bit longer time even while they're not actively being computed on right now?

  • cesarb 3 days ago

    > Perhaps this is layman's understanding, but afaik all CPU operations are only done on registers, so any variable has to be moved to registers when it's operated on

    From the CPU point of view, yes, but many instructions set architectures (including the very common x86 family) have instructions in which one of the operands can be either a register or a memory location. When it's a memory location, the CPU will load it into an internal register, do the operation there, and write it back to memory if it's the destination operand; but from the point of view of the compiler, the CPU is operating directly on memory.

    > (or are some operations possible on memory without going to registers?)

    Some atomic memory operations (like atomic compare and exchange) can be thought of as being done directly on memory (or on the cache which sits above the memory). Some CPUs might even implement it that way (by having an "atomic memory operation" command between the CPU and the cache, instead of doing the operation on the CPU).

  • yubblegum 3 days ago

    It's basically a caching problem. Unloading variables and saving them and loading another one because it is needed for a compuation takes time. So you want to minimize that, just like with a cache.

    Now a compiler has an advantage over a pure cache in that it actually gets to read the code and so knows the exact number of variables and when they get used! whereas with a pure cache it has to be an oracle :) so, a compiler can try and optimize the cache placements (register assignments) ahead of time by analyzing the code. That's where these algorithms come into play. So knowing nothing we can safely guess that all compiler algorithms must be beating something simple like an LRU ..

  • mystified5016 3 days ago

    It extremely depends on the target CPU. Some CPUs do have instructions that operate on RAM directly, no register required.

    However, the tradeoff is that RAM is almost always 2-4 times slower than writing to registers. Most of the time it's more efficient to copy from RAM to a register and back.

    Most CPUs are designed to operate on registers, but it's not too uncommon to see simple things like incrementing a variable in RAM without an intermediate copy. There might be more, I don't know a lot about x86

  • noelwelsh 3 days ago

    That's not 100% correct but it's correct enough to give sufficient intuition to understand the problem.

travisgriggs 3 days ago

Reading the top voted answer made me feel like I was reading one if Jon Hannibal Stokes CPU praxis write ups from the early years of ARS. Anyone else remember those?