Author here - any tips on dirty tricks somehow using %fs + offset from a shared library, without calling __tls_get_addr and without recompiling the executable (for instance, from a Python extension module) are most welcome! (I have one such trick - mmaping an address produced by ORing some high bit into %fs, and accessing that - but it's seriously dirty and I doubt it would survive outside a limited set of use cases.)
Separately, the post describes one of a few aspects of optimizing funtrace (https://github.com/yosefk/funtrace), which I think is the fastest open-source function tracing profiler for C++ today, and which can be ported relatively easily to Rust/other languages producing native code (the runtime is ~1200 LOC of C++; the trace decoder would need very few changes.) A description of how funtrace works as well as why you'd want a tracing profiler (as opposed to a sampling profiler like perf) and how you'd use it is here: https://yosefk.com/blog/profiling-in-production-with-functio...
With glibc, you can use -ftls-model=initial-exec (or the corresponding variable attribute) to get offset-based TLS. The offset is variable per program (unlike local-exec), but the same for all threads, so it's more efficient. Using too much initial-exec TLS (potential across multiple shared objects) eventually causes dlopen to fail because the TCB cannot be resized. This is not a problem if the shared objects are loaded through dependencies at process start.
If initial-exec TLS does not work due to the dlopen issue, on x86-64 and recent-enough distributions, you can use -mtls-dialect=gnu2 to get a faster variant of __tls_get_addr that requires less register spilling. Unfortunately glibc and GCC originally did not agree on the ABI: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113874https://sourceware.org/bugzilla/show_bug.cgi?id=31372 This has been fixed for RHEL 10 (which switched to -mtls-dialect=gnu2 for x86-64 for the whole distribution, thereby exposing the ABI bug during development). As the ABI was fixed on the glibc side in dynamically-linked code, the change is backportable, but it's a bit involved because the first XSAVE-using change upstream was buggy, if I recall correctly. But the backport is definitely something you could request from your distribution.
Note that there was a previous bug in __tls_get_addr (on all architectures that use it), where the fast path was not always used after dlopen: https://sourceware.org/bugzilla/show_bug.cgi?id=19924 This bug introduced way more overhead that just saving registers. I expect that quite a few distributions have backported the fix. This breaks certain interposed mallocs due to a malloc/TLS cyclic dependency, but there is a workaround for that: https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=018f0...
The other issue is just that the C++ TLS-with-constructors design isn't that great. You can work around this in the application by using a plain pointer for TLS access, which starts out as NULL and is initialized after a null check. To free the pointer on thread exit, you can use a separate TLS variable or POSIX thread-specific data (pthread_key_create) to register a destructor, and that will only be accessed on initialized and thread exit.
> Using too much initial-exec TLS (potential across multiple shared objects) eventually causes dlopen to fail because the TCB cannot be resized.
This feels like the perfect situation to preallocate a gigabyte or something of virtual memory for extending the TLS, similar to how the stack is. But, testing on my system, looks like the allowed initial-exec TLS size is just ~1700 bytes.
There is the idea floating around to do this for the main thread (especially for audit mode, where the automatic static TLS size discovery does not work due to the way the audit interfaces are defined, it's not merely an implementation limitation). This would allow for adjustment of initial-exec TLS expectations if the critical dlopen calls happen before the process goes multi-threaded. For threads created with pthread_create, the TCB is still at the top of the stack, which is quite efficient for access. Moving that would probably consume a TLB slot. Spacing things far apart due to aggressive address space reservations will impact TLB miss performance.
But with virtual memory it wouldn't be much of a cost at all (..on 64-bit systems, that is; things are more sad on 32-bit if one cares about those). Just some kernel-internal data structure configuration to ensure that future memory page allocations don't overlap this one.
dlopen is a requirement for importing native libraries in non-compiled languages; and, regardless, I as a library author don't get to choose whether users will avoid using dlopen and so have to assume worst-case.
Virtual memory still costs, you know, something like 0.2% of virtual memory space in page table entries. 1 GB of VMA per thread is 2MB of real RAM cost per thread. And there's absolutely no need for that kind of space use -- the thread-local variable can just be a pointer to a heap-allocated large object.
In addition to the ways that page table entries can be avoided, the system can use large pages for all the areas you aren't using yet, cutting the overhead to 4KB.
Yeah, a gigabyte is most likely extremely overkill indeed, a megabyte or so would be plenty; though the goal would be to get threadlocals to be able to be as arbitrarily large as non-initial-exec threadlocals so it wouldn't break anything ever.
I don't know how the kernel manages it internally, but there's no need for PROT_NONE preallocated virtual memory to be mapped to actual CPU-accessible pages at least; and `mmap(NULL, 1ULL<<46, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0)` takes ~4 microseconds to map 64 terabytes of virtual memory so it's definitely not 0.002x overhead. (perhaps the overhead amount changes depending on how close to a page level the size is, but it shouldn't be too much regardless)
This'd essentially be turning the preallocated TLS space as a memory allocation arena (and you could actually even just choose to provide an alloc+free interface for programs to dynamically allocate fs-relative-offsets to use for custom threadlocals?).
(then there's general problematicness of virtual memory; such PROT_NONE never-touched memory still counts towards virtual memory usage, which is annoying; browsers/Java/etc already suffer from this, but it'd be rather ugly for literally all processes to have such. I'd quite like a memory usage counter that includes all memory that is or ever was writable, but not PROT_NONE never-touched; i.e. how much memory the process can eventually require without running explicitly requesting more via syscalls, but afaik such just doesn't exist, or at least isn't a standard-displayed thing)
> how much memory the process can eventually require without running explicitly requesting more via syscalls
This concept is called "commit charge". Windows MM models it explicitly. Linux ought to as well. I agree it's a more useful concept than just address space allocated.
Interesting! Some searching later, looks like htop's DATA/M_DRS counter (i.e. second-to-last number in /proc/<PID>/statm) appears to count something related-ish; i.e. doesn't count a PROT_NONE mmap, but does a PROT_READ|PROT_WRITE untouched one; nothing in statm appears to count untouched writable MAP_SHARED though, though (potentially?-)shared mappings do get complicated in general.
Some more experimentation later, it seems to be more like just counting PROT_WRITE+MAP_PRIVATE mappings or so; i.e. mprotect(PROT_NONE)ing (or even just PROT_READ) a writable region results in it not being counted, even if the region was modified and thus must actually be persisted. So it can actually get meaningfully lower than RSS. :/
I think what you want is to force the "initial exec" model using an attribute to get the more efficient code.
IIRC this (or something equivalent) is what libGL does because basically every OpenGL function call needs to read the thread-local variable holding the current GL context.
The downside is that dlopen()ing your library may fail.
Doesn't the "initial exec" model require the user to run the executable with LD_PRELOAD=your.so or linking against your.so (DT_NEEDED)? eg would it work with a Python extension module which is dlopen'd at runtime?
It depends on the amount of TLS and the number of objects that do this, and the C library. In glibc, we have a small static TLS reservation dedicated for use by dlopen of shared objects with initial-exec.
For your specific use case, CPU-local (rather than thread-local) memory might also be an option. Search for rseq on Linux. Relatively fresh feature though, not a lot of experience around it exists yet.
> How do you avoid the problem of threads migrating between CPUs at arbitrary instruction boundaries?
rseq is short for restartable sequences; you mark beginning and end of a range of instructions, plus an abort path. The kernel checks this during task-switch and if you're within the range the instruction pointer is changed to the abort path on resumption.
(It's called restartable because the assumption is that the abort path will try again. Or at least, aborting the block of instructions midway through is recoverable.)
The primary limitation is that the "result" of the block in most cases needs to be concentrated into the last instruction of the block (similar to an atomic release write.) Otherwise you'd need to somehow recover from partially executed rseq blocks having changed some visible state but not fully completed.
For a tracing profiler, you want to know which thread a function call or return was made by. LTTng has kernel modules which it can use to trace context switches, and then a per-CPU trace buffer is fine, provided that you get cheap atomic writes which rseq can be used for.
Funtrace on the other hand does support ftrace for tracing context switches (https://yosefk.com/blog/profiling-in-production-with-functio...), but it doesn't require ftrace for tracing function calls made by your threads (the problem with ftrace as well LTTng's kernel modules being, of course, permissions; which shouldn't be a problem in any reasonable situation by my standard of "reasonable", but many find themselves in unreasonable situations permissions-wise, sadly.) So I don't think funtrace can use rseq, though I might be missing something.
Presumably you could store the TID in every event, or otherwise check whether the TID has changed since the last time it was logged and push a (timestamp, TID) pair if so. Reading TID should be cheap.
In what sense should reading the TID be cheap? You would need either a syscall (not cheap) or thread-local storage (the subject of TFA.) Avoiding the use of TLS by reading the TID can't really work
It looks like the TID is stored directly in the pthread struct pointed to by %fs itself, at a fixed offset which you can somewhat-hackily compile into your code. [0]
In the process of investigating this, I also realized that there's a ton of other unique-per-thread pointers accessible from that structure, most notably including the value of %fs itself (which is unfortunately unobservable afaict), the address of the TCB or TLS structures, the stack guard value, etc. Since the goal is just to have a quickly-readable unique-per-thread value, any of those should work.
Windows looks similar, but I haven't investigated as deeply.
Hi, love the article. You mention in the article that a hardware mechanism for tracing should exist -- have you investigated the intel_pt (processor trace) extension? I believe this uses hardware buffers and supports timestamping & cycle counters (at somewhat larger than instruction granularity sadly, although it might issue forced stamps at on branches, not sure).
You can also use the PTWRITE instruction to attach metadata to the stream which seems very powerful.
Intel PT is indeed useful (although very, very slow compared to regular sampling profiling), but there's hardly any CPUs that actually implement PTWRITE. (IIRC there's some obscure Xeon or something?)
Typically you get a cycle count every six branches, give or take.
Regarding the slowdown - magic-trace reports 2-10% slowdowns which IMO is actually fine even for production (unless this adds up to a huge dollar cost, for most people it won't) since in return for this you are actually capable to debug the rare slowdowns which are the worst part of your user experience.
However, the hardware feature that I propose (https://yosefk.com/blog/profiling-in-production-with-functio...) would likely have lower overhead since it relies on software issuing tracing instructions, eg at each function entry & exit (rather than any control flow change), and it could be variously selective (eg exclude short functions without loops; and/or you could configure the hardware to ignore short calls. BTW maybe you can with Intel Performance Trace, too, I'm just not really familiar with it.)
Like I said there, I'm frankly shocked that all CPUs haven't raced to implement similar features, that magic-trace which is built on top of Intel Performance Trace isn't used more widely, and that developers aren't insisting on running under magic-trace in production and requiring to deploy on Intel servers for that purpose.
The extension I propose is much simpler, and seems similar to what PTWRITE would do if it was the only feature in Intel Performance Trace. I have a lot of experience in chip architecture, and I believe that every CPU maker and every chip maker can support this easily - much more so than full feature parity with Intel Performance Trace. I hope they will!
I wonder if this is a general issue relating to memory ordering or out-of-order execution, or whether this can be implemented more efficiently in a different extension.
Thank you for the linked article! Agreed on the huge potential for using these tools in production. The community could definitely benefit (even indirectly) by pushing for this kind of instruction set more widely.
GCC and Clang both provide `_readfsbase_u64()`, which presumably is just a wrapper around `__asm__("rdfsbase %0" : "=r"(result))`. Also gcc lets you define variables like `int__seg_fs *foo;`if the `__SEG_FS` preprocessor symbol is defined. With Clang, you can use `int __attribute__((address_space(257))) *foo`
As a result of the later kernel support, it may not be for everyone to turn on. Until then, the replacement on GNU Linux is just to load %fs:0, which the x86-64 ABI requires to have the same value (or %gs:0 on i386). However, usually, for initial-exec TLS access, the address of variable is not required to be in a register.
Is SEGFS/%fs-based access slower than loading the base address with RDFSBASE (which may require spilling a register) and then using base+offset access? I haven't seen such reports.
To the author: If you have a specific situation that the codegen is failing to optimize well (e.g. the ctor cases that you ran into) you can store the offset to your variable in a non-TLS global, then manually add the FS/GS to it. Use inline asm if you need to bypass any init checks.
An alternative that might be worth looking into is just hashing the FS/GS into a table index. It will be slower than the well-optimized case, but it will let you opt out of the TLS allocation process altogether. This might be a good thing in some cases for a low-level facility like a function tracer.
I have updated the post with all the suggestions people replied with except this one, because I don't understand it. How do I allocate memory for addressing it with FS/GS? Isn't FS the register pointing to the TLS area - then how is the FS-based access you propose different from how TLS normally works? Isn't GS used "for something" on x86?.. If you could elaborate on this/show some code I would be very grateful!
Note that my question is about shared libraries. If the thread_local is linked into an executable, I guess you could indeed save the offset somewhere and then add the value of %fs to it, though if this is a way to work around the constructor issue, I prefer to not have a constructor. The question is if this sort of direction can help for thread-local storage allocated by a shared library.
> How do I allocate memory for addressing it with FS/GS?
Allocate the variable normally, then compute the offset in one thread (e.g. offset = uintptr_t(&variable) - get_fs()), then access it by adding the offset to FS in any thread (e.g. (vartype *) (offset + get_fs())). The only difference from how it normally works is that you can manually force it to be inlined, sidestepping the codegen problems you described in your post. But if you can avoid those problems by not using constructors instead, that's definitely better.
I used "FS/GS" because GS is used instead of FS on some systems for the same purpose.
The shared library-specific issues are one of the reasons I was suggesting maybe looking into hashing, e.g. perhaps as a fallback solution when the TLS approach fails.
I think GP is talking about something like this [0]. You let it call __tls_get_addr() once in a constructor, take the offset from %fs, store it in a static variable, and use that offset directly. (The static variable doesn't need to be atomic, since it's only written to once, at dlopen() time.)
Is using pthread_key_create / pthread_setspecific / pthread_getspecific / pthread_key_delete better or worse or the same? As I understand it, this is the low level API you were looking for.
Ah, but you can "inline" it by hardcoding its implementation details into your code in the most horrible way imaginable, as suggested by a crafty reader and which, scarily enough, seems to be the fastest and most robust TLS access method from a shared library: https://yosefk.com/blog/cxx-thread-local-storage-performance...
Big fan. I wish you could write more. I am gonna go check the assembly of my own code for this. And I am going to integrate funtrace into my workflow.
Question about your build system comment: my build system doesn't stoop to figuring out if `-fPIC` is needed, but it also doesn't add it unless the user asks. Were you talking about that or build systems that add it automatically?
I don't think build systems add -fPIC automatically, nor remove it automatically. C++ build systems do not stoop to the question of how to best build a C++ program, by and large. They are more task graph executors - either bad ones, like make, or good ones, like Bazel, but mostly task graph executors; the most "C++ support" you will get is native support for scanning #include files (as opposed to doing it yourself like make forces you to.)
Idk about Bazel, but Buck2 (and Buck1, but 2 is better in every way) does handle adding -fPIC for shared libraries and leaving it off when statically linking. Each library has a PIC and non-PIC version, and the appropriate one is selected depending on if the library is being linked into an executable or a shared library. (The versions are only built when needed.) I don't remember if it also handles -fPIE similarly smartly though if you do want position-independent executables.
> I don't think build systems add -fPIC automatically, nor remove it automatically.
GNU Autotools do, if you tell, that you have a shared library. You can also switch between static/dynamic linking at build time and if you also use GNU Libtool, it will figure out the flags of your build platform at build time.
> as opposed to doing it yourself like make forces you to
You can definitely don't have to do that. GNU Automake does it by default, but if you are using plain Make, you can also use makedepend or the appropriate flags of your compiler.
> either bad ones, like make
What is wrong/missing with make as a task graph executor?
I want to build a "standard library" for my build system that would stoop to that. I can do that because my build system is not just a task graph executor; it is backed by a full programming language and can add its own libraries to that language. IOW, I can add a `cpp` package to the build system that implements support for how to best build a C++ program.
So if you have a wishlist for that support, I'd love to hear it.
> If I manage to stay out of trouble, it’s rarely because of knowing that much, but more because I’m relatively good at 2 other things: knowing what I don’t know, and knowing what I shouldn’t know.
> [...]
> I don’t know how to generalize the principle to make it explicit and easy to follow.
Coming from mathematics that is what I would call using the right level of abstraction.
If you want to prove 0 + 0 = 0 and you're getting tied up with stuff like how the direct sum of two Cauchy sequences should converge to the sum of the two limits then you're not working in the right level of abstraction. You're not supposed to know about Cauchy sequences yet if all you're given is the neutral element for addition.
In some rare cases it can help knowing about sublevels of abstraction. Such as the difference between a general linear space and one equipped with an inner product. Just because you can make an inner product doesn't mean you should, and if you don't you'll find some arguments a lot easier because you're not distracted by stuff like adjoints and orthonormal basis vectors etc. (one side effect is that gradient descent no longer works, and you really ought to know why). You can do similar things by refusing to decide on a coordinate system.
Regarding: `cmpb $0, %fs:__tls_guard@tpoff`, the per-function-call overhead is due to dynamic initialization on first use requirement:
> Block variables with static or thread(since C++11) storage duration are initialized the first time control passes through their declaration (unless their initialization is zero- or constant-initialization, which can be performed before the block is first entered). On all further calls, the declaration is skipped. --- https://en.cppreference.com/w/cpp/language/storage_duration
> If you know x does not need dynamic initialization, C++20 constinit can make it as efficient as the plain old `__thread`. [[clang::require_constant_initialization]] can be used with older language standards.
Regarding `data16 lea tls_obj(%rip),%rdi` in the general-dynamic TLS model, yeah it's for linker optimization. The local-dynamic TLS model doesn't have data16 or rex prefixes.
Regarding "Why don’t we just use the same code as before — the movl instruction — with the dynamic linker substituting the right value for tls_obj@tpoff?"
Because -fpic/-fPIC was designed to support dlopen.
The desired efficient GOTTPOFF code sequence is only feasible when the shared object is available at program start, in which case you can guarantee that
"you would need the TLS areas of all the shared libraries to be allocated contiguously:"
With dlopen, the dynamic loader needs a different place for the TLS blocks of newly loaded shared libraries, which unfortunately requires one more indirection.
Regarding "... and I don’t say a word about GL_TLS_GENERATION_OFFSET, for example, and I could."
`GL_TLS_GENERATION_OFFSET` in glibc is for the lazy TLS allocation scheme. I don't want to spend my valuable time on its implementation...
It is almost infeasible to fix on the glibc side.
> the per-function-call overhead is due to dynamic initialization on first use requirement
Thanks - I didn’t realize this was mandated by the standard as opposed to “permitted” as one possibility (similarly to how eg a constructor of a global variable can be called before main or upon first use or anywhere in-between according to the standard). Updated the post with this point
> The desired efficient GOTTPOFF code sequence is only feasible when the shared object is available at program start, in which case you can guarantee that “you would need the TLS areas of all the shared libraries to be allocated contiguously”
Indeed I didn’t mention -ftls-model=initial-exec originally (I now added it based on reader feedback; it can work when it will work, which for my use case is a toss-up I guess…), but my point is that you could allocate the TLSes contiguously even if dlopen was used, and I describe how you could do it in the post, albeit in a somewhat hand-wavy way. This is totally not how things were done and I presume one reason is that you don’t carve out chunks of the address space for a use case like this as described in my approach - I just think it would be nice if things worked that way.
Actually sounds like it isn't mandated by the standard after all; it's mandated for block thread_locals but not for thread_locals in the global scope:
3.7.2/2 [basic.stc.thread]: A variable with thread storage duration shall be initialized before its first odr-use (3.2) and, if constructed, shall be destroyed on thread exit.
This allows the constructor to be called at any point before the first use, similarly to "normal" globals, though implementations made different tradeoffs in these 2 cases
I never investigated TLS perf nearly this deeply, but in a previous DB project we could only salvage perf by banning thread_local in favor of __thread, so it was impossible for TLS to reference anything but a bare scalar or pointer. We just rolled up all our TLS objects into one struct, heap-allocated that struct in the thread routine, and put a pointer to that struct in TLS. It was easy to write trivial accessors to these objects through the TLS pointer that were always inlined.
Why aren't thread-local constructors called when the thread is created rather than doing a test on every variable access? Perhaps they wanted to have the OS-native pthread_create/CreateThread "work out of the box" without having to manually call `construct_thread_locals()` or whatever? That would be an insane tradeoff, especially considering `construct_thread_locals()` can be called automatically when using std::thread.
I think this, and all the other weirdness, is because usage of thread-local variables is pretty low. Probably somebody wanted to have "lightweight" threads doing work as quickly as possible accessing only local variables.
The situation is much more complicated than it sounds, because the question of where the variables are isn't known at compile time. It has to be done at link time. Which may include dynamic link time as a library is loaded. That all sounds fairly horrible.
All to handle static constructors. If you don't have thread-static variables, or they don't have constructors, it's much simpler.
I don't buy the static linking part - we already generate a single function in the linked binary that calls all static constructors from all translation units, the same thing can be done for TLS constructors (what I called construct_thread_locals() in parent), and force the user to call that manually on thread creation, or call it as part of std::thread.
Dynamic linking is a whole other beast, and maybe for thread_local's in a DLL it's acceptable (or the only possible way?) to do construct them lazily - if you're using DLL+TLS+global constructors you deserve the pain.
The problem is that dynamic linking is the norm. If you're writing a specific application, that's one thing. But if you're writing some random C++ library then you usually have no idea whether it will be compiled into an executable or a shared library. If TLS is slow in either of those cases, then the only solution is to not use TLS at all.
Because that can be very expensive. If you spawn a thread which does a little work and exits, having that also first need to initialize every thread local from every loaded library on the off chance that it might need that thread local would probably be not worth it.
Creating a thread is already very expensive, in a sane program your ten TLS constructors shouldn't come even close to the amount of bookkeeping that the kernel has to do to spin up a new thread.
Also creating a thread, doing little work and exiting immediately is a bad use of threads -- I don't think that everyone using thread_local should be punished to better support poorly written programs. Most threads are long-running and most programs spin up a bunch of worker threads at initialization time, I'd argue variable access time is way more important than thread startup time.
In a simple benchmark, creating a thread & waiting on it in a loop (pthread_create+pthread_join) takes 0.03ms = 30000ns, so running a bunch of constructors of a bunch of libraries could actually be a non-insignificant amount of time.
Do agree that threadlocal read speed should be a lot more important, but it'd quite suck if the "acceptable performance" thread usage approaches would vary dramatically depending on what libraries you've (or something else) has unrelatedly loaded. (though maybe other overheads might appear from other sources similarly, keeping this not the most important slowdown, idk)
If you make use of dynamic linking namespaces and some sort of runtime, you can make it impossible to enter specific libraries in certain libraries in certain threads and avoid that cost. If your application is large enough that you can't keep track of deps, then you may want something like that in your application if you're not willing to split it into multiple processes.
> Creating a thread is already very expensive, in a sane program your ten TLS constructors shouldn't come even close to the amount of bookkeeping that the kernel has to do to spin up a new thread.
This is not true, at least on Linux. With appropriate settings (e.g. a small stack) thread creation can be extremely cheap.
> Is this a deliberate performance tradeoff, benefitting code with lots of thread_locals and starting threads constantly, with each thread using few of the thread_locals, and some thread_locals having slow constructors? But such code isn't great to begin with?
Actually, thread globals only being used in some threads is probably the common case. Remember that threads are not only created by the main executable but often also by libraries. Keeping per-thread overhead low until the TLS variables are actually needed sounds like a good design goal.
> funtrace sidesteps the TLS constructor problem by interposing pthread_create, and initializing its thread_locals in its pthread_create wrapper
Sounds extremely fragile. What if someone calls clone() directly or another (possibly new) function to create a thread.
I dunno if extremely fragile; I recommend trying funtrace on your own code - pretty sure it will work out of the box! But, what's less fragile than interposing pthread_create which doesn't add a load and a branch to every function call and return which is where a tracing profiler will need to access TLS?..
In any case, one point of funtrace is that it's small (~1K LOC runtime) and you can tweak it easily, including calling its TLS init code from your threads if you don't create them with pthread_create "like most people" - even without this issue, people do "green threads" with setcontext/getcontext and who knows what else, which will need its own hacks to support, and so I think that an easily hackable runtime is a good thing given how hard it is to do this in a one-size-fits-all fashion.
As a counterexample, LLVM XRay, a tracing profiler from Google which is at least 10x bigger than funtrace (if you don't count the compiler instrumentation they introduced), took almost a decade to gain shared library support, and it's not done yet. So I think "small and hackable" has its advantages.
One way to avoid the extra code from constructors is to make them constexpr. This way the compiler can initialize it with 0 or a value and not generate all that extra code to execute a constructor at runtime.
I'm not sure how many people interested in this article is interested in C++ compile times but I once measured and wrote an article https://bolinlang.com/wheres-my-compile-time
I haven't fully dugged through the article, just skimmed over, looks like all good info all around.
I'm also mostly familiar with Windows, and on Windows until recently (few years ago), dynamically loading (e.g. with "LoadLibrary", e.g. "dlopen") a .dll caused issue with the .dll's own thread_locals. Microsoft fixed this, but folks have observed slower code
> dynamically loading (e.g. with "LoadLibrary", e.g. "dlopen") a .dll caused issue with the .dll's own thread_locals.
Do you know what these issues were? I'm curious because I'm working on Pd (https://en.wikipedia.org/wiki/Pure_Data), which uses lots of thread local variables internally when built as a multi-instance library. Libpd itself may be loaded dynamically when embedded in an audio plugin. I'm not aware of any problems so far...
It used to be that .dlls loaded by the .exe on startup (e.g. implicitly listed there) would get their thread local vars correctly (TLS), but dlls loaded later (like /DELAYLOAD or through LoadLibrary) would not. (the workaround was to initialize these through TlsAlloc/TlsFree, and have hook in DllMain to clean up)
Note that MinGW uses libwinpthread, which is known to have slow TLS behavior anyway (I've observed a 100% overhead compared to running the same program under WSL using a linux-native GCC). c.f. https://github.com/msys2/MINGW-packages/discussions/13259
then last year clang-cl also added ways to disable this (if need to), probably this hit some internal issue and had to be resolved. Maybe "thread_local" have become more widely used (unlike OS specific "TlsAlloc")
So we can avoid a lot of this pain by not using constructors for thread_local variables, right? We can roll our own version of this using a thread_local bool, or just interpose thread spawning as you've done.
As far as why it's worse with fPIC/shared: there are a variety of TLS models: General Dynamic, Local Dynamic, Initial Exec, and Local Exec. And they have different constraints / generality. The more general models are slower. IIRC IE/LE won't work with shared library thread_locals, but it's been a while so don't quote me on that.
Generally agree that it seems like the compiler could in theory be doing a better job in some of these circumstances.
> I’m sure there’s some dirty trick or other, based on knowing the guts of libc and other such, which, while dirty, is going to work for a long time, and where you can reasonably safely detect if it stopped working and upgrade it for whatever changes the guts of libc will have undergone. If you have an idea, please share it!
Find the existing TLS allocations, hope there's spare space at the end of the last page, and just map your variables there using %fs-relative accesses?
Always fun to see another high performance tracing implementation. We do some similar things at work (thread-local ringbuffers), though we aren't doing function entry/leave tracing.
> in my microbenchmark I get <10 ns per instrumented call or return
A significant portion of that is rdtsc time, right? Like 50-80%. Writing a few bytes to a ringbuffer prefetched in local cache is very very cheap but rdtsc takes ~4-8 nanos.
> While we're on the subject of snapshots - you can get trace data from a core dump by loading funtrace_gdb.py from gdb
Nice. We trace into a shared memory segment and then map it at start time, emitting a pre-crash trace during (next) program start. Maybe makes more sense for our use case (a specific server that is auto-restarted by orchestration) than a more general tracing system.
> A significant portion of that is rdtsc time, right? Like 50-80%. Writing a few bytes to a ringbuffer prefetched in local cache is very very cheap but rdtsc takes ~4-8 nanos.
Exactly right, though it seems totally insane to me coming from an embedded background where you get the cycle count in less than 1 ns but writing to the buffer would be "the" performance problem (and then you could eg avoid writing short calls as measured at runtime, but on x86 you will have spent too much time on the rdtsc for this to lower the overhead.) There's also RDPMC but it's not much faster and you need permissions(tm) to use it, plus it stops counting on various occasions which I never fully understood.
Regarding prefetching - what do you do perfetching-wise that helps performance?.. All my attempts to do better than the simplest store instructions did nothing to improve performance (I tried prefetchw/__builtin_prefetch, movntq/_mm_stream_pi and vmovntdq/_mm_stream_si128, all of them either didn't help or made things even slower)
This is missing out on considerations re. tls-model=local-dynamic and visibility; making variables non-preemptible by declaring them static or visibility("hidden"/"internal") significantly reduces TLS cost.
-ftls-model=local-dynamic barely changes the generated code compared to the default (global-dynamic); you no longer have data16 prefixes around the call to __tls_get_addr (which means this call can't be replaced with something cheaper if you don't put the .o into a .so in the end) but otherwise it's the same.
-fvisiblity=hidden / __attribute__ ((visibility ("hidden"))) doesn't seem to do anything.
Are you sure that either of these speeds up TLS access for thread_locals defined in shared libraries?
If I remember correctly, local-dynamic is relevant if you access multiple thread-local variables, as the offset between them will be constant. The visibility attributes should allow the compiler to switch from global-dynamic to local-dynamic automatically. It's been a bit since I looked at all of this.
Also note that these sequences are highly architecture dependent and cost as well as cost differences will vary e.g. for ARM or ARM64.
I updated the post with the point on visibility; in my tests inspired by https://lobste.rs/s/b5dnjh/0_0_0_c_thread_local_storage_perf... I see that clang improves codegen thanks to hidden visibility but g++ does not, and that comment says clang doesn't do it on all platforms.
I stick to the broad conclusion that thread_locals without constructors linked into an executable rather than a shared library are the fastest and most performance-portable by far, but the visibility point is very worth mentioning.
As well as other costs. That we default to global visibility is another one of the ossified defaults the article mentions. One of the first things I look at when trying to improve the performance of a C++ project is its linking setup --- most of the time, people are using the terrible ELF defaults and paying in code size and execution time for something they don't need.
I can confirm that C (GCC and Clang) will generate this:
7e2f0: f3 0f 1e fa endbr64
7e2f4: 64 48 8b 04 25 f8 ff mov %fs:0xfffffffffffffff8,%rax
7e2fb: ff ff
7e2fd: c3 ret
7e2fe: 66 90 xchg %ax,%ax
But they did not inline the function that does that, even with `__attribute__((always_inline))` and link-time optimization. I need to investigate that.
Is this from a shared library (compiled with -fPIC and linked with -shared) though? That's where the big troubles start and I think C isn't better than C++ here
I am grateful that I only use one multi-call executable (built from a monorepo made partially because of your post, btw) because I think shared libraries in C might have the same sort of problems, just without constructors.
One more entitled developer with a claim that something is slow but can't really prove it because he can't measure it nor does he have any other working code to compare how blazingly fast can it be. Gotcha.
Tip to the author: you can implement a lock-free buffer pool cache for as many threads you need in ~200 LoC. Another tip: it's not gonna be faster but you may get more control over the resources
At the bottom of the article the author provides links to regression reports. In one workload calculating the thread local address was taking 10% of execution time up from 1%. The burden of proof would seem to be on you.
No, that's not how it works. The burden of proof is on the one who is claiming the regression or design flaw, and in this case I am not sure what exactly the author can even claim? He did not run a single experiment where he would show that a non-inlined ctr call is creating a runtime pressure on his fancy profiler. Not a single experiment. Not a single data. This is a pure form of cargo-cult programming not engineering.
The unfortunate effect of this and similar blogs is that tomorrow you will have to deal with someone at your work who happened to read this and such material on the web and will take it for granted. He will take it further and implement an "optimization" yet when you ask to demonstrate the problem he was solving you will get big nothing. Just as in this case.
No, I actually read the article and I don't think I was spectacularly rude. I was rather just straight to the point. And rather than bitching "omg, why, oh why did a compiler turn a ctr into a function-call here?!" what I think I gave is a constructive suggestion how to fix the problem, if there was even one to begin with.
What I see here in the article is pure theoreticizing and dwelling over details which in 99.9% of the cases do _not_ matter. There wasn't a single experiment shown by the author by which he shows that his use-case is in that 0.01%.
I was basically done with reading the article when I read the following nonsense
> accessing an extern thread_local with a constructor involves a function call, with the function testing the guard variable of the translation unit where the thread_local is defined.
And then
> But with inlining, the fast path is quite fast on a good processor.
Right, the fast path of inlined ctr call vs slow path of non-inlined ctr call. Really? I call this a BS, especially without the experiments showing the data.
Author here - any tips on dirty tricks somehow using %fs + offset from a shared library, without calling __tls_get_addr and without recompiling the executable (for instance, from a Python extension module) are most welcome! (I have one such trick - mmaping an address produced by ORing some high bit into %fs, and accessing that - but it's seriously dirty and I doubt it would survive outside a limited set of use cases.)
Separately, the post describes one of a few aspects of optimizing funtrace (https://github.com/yosefk/funtrace), which I think is the fastest open-source function tracing profiler for C++ today, and which can be ported relatively easily to Rust/other languages producing native code (the runtime is ~1200 LOC of C++; the trace decoder would need very few changes.) A description of how funtrace works as well as why you'd want a tracing profiler (as opposed to a sampling profiler like perf) and how you'd use it is here: https://yosefk.com/blog/profiling-in-production-with-functio...
With glibc, you can use -ftls-model=initial-exec (or the corresponding variable attribute) to get offset-based TLS. The offset is variable per program (unlike local-exec), but the same for all threads, so it's more efficient. Using too much initial-exec TLS (potential across multiple shared objects) eventually causes dlopen to fail because the TCB cannot be resized. This is not a problem if the shared objects are loaded through dependencies at process start.
If initial-exec TLS does not work due to the dlopen issue, on x86-64 and recent-enough distributions, you can use -mtls-dialect=gnu2 to get a faster variant of __tls_get_addr that requires less register spilling. Unfortunately glibc and GCC originally did not agree on the ABI: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113874 https://sourceware.org/bugzilla/show_bug.cgi?id=31372 This has been fixed for RHEL 10 (which switched to -mtls-dialect=gnu2 for x86-64 for the whole distribution, thereby exposing the ABI bug during development). As the ABI was fixed on the glibc side in dynamically-linked code, the change is backportable, but it's a bit involved because the first XSAVE-using change upstream was buggy, if I recall correctly. But the backport is definitely something you could request from your distribution.
Note that there was a previous bug in __tls_get_addr (on all architectures that use it), where the fast path was not always used after dlopen: https://sourceware.org/bugzilla/show_bug.cgi?id=19924 This bug introduced way more overhead that just saving registers. I expect that quite a few distributions have backported the fix. This breaks certain interposed mallocs due to a malloc/TLS cyclic dependency, but there is a workaround for that: https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=018f0...
The other issue is just that the C++ TLS-with-constructors design isn't that great. You can work around this in the application by using a plain pointer for TLS access, which starts out as NULL and is initialized after a null check. To free the pointer on thread exit, you can use a separate TLS variable or POSIX thread-specific data (pthread_key_create) to register a destructor, and that will only be accessed on initialized and thread exit.
This sort of question is probably more suited to libc-help: https://sourceware.org/mailman/listinfo/libc-help/
> Using too much initial-exec TLS (potential across multiple shared objects) eventually causes dlopen to fail because the TCB cannot be resized.
This feels like the perfect situation to preallocate a gigabyte or something of virtual memory for extending the TLS, similar to how the stack is. But, testing on my system, looks like the allowed initial-exec TLS size is just ~1700 bytes.
There is the idea floating around to do this for the main thread (especially for audit mode, where the automatic static TLS size discovery does not work due to the way the audit interfaces are defined, it's not merely an implementation limitation). This would allow for adjustment of initial-exec TLS expectations if the critical dlopen calls happen before the process goes multi-threaded. For threads created with pthread_create, the TCB is still at the top of the stack, which is quite efficient for access. Moving that would probably consume a TLB slot. Spacing things far apart due to aggressive address space reservations will impact TLB miss performance.
Remember that you pay this cost for every thread, not just the main thread.
Frankly, if your name isn't `libGL.so` you shouldn't even try to mix initial-exec with dlopen. Just link your libraries normally dammit!
But with virtual memory it wouldn't be much of a cost at all (..on 64-bit systems, that is; things are more sad on 32-bit if one cares about those). Just some kernel-internal data structure configuration to ensure that future memory page allocations don't overlap this one.
dlopen is a requirement for importing native libraries in non-compiled languages; and, regardless, I as a library author don't get to choose whether users will avoid using dlopen and so have to assume worst-case.
Virtual memory still costs, you know, something like 0.2% of virtual memory space in page table entries. 1 GB of VMA per thread is 2MB of real RAM cost per thread. And there's absolutely no need for that kind of space use -- the thread-local variable can just be a pointer to a heap-allocated large object.
Only if mapped. If not mapped, there's no need for page table entries.
In addition to the ways that page table entries can be avoided, the system can use large pages for all the areas you aren't using yet, cutting the overhead to 4KB.
Yeah, a gigabyte is most likely extremely overkill indeed, a megabyte or so would be plenty; though the goal would be to get threadlocals to be able to be as arbitrarily large as non-initial-exec threadlocals so it wouldn't break anything ever.
I don't know how the kernel manages it internally, but there's no need for PROT_NONE preallocated virtual memory to be mapped to actual CPU-accessible pages at least; and `mmap(NULL, 1ULL<<46, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0)` takes ~4 microseconds to map 64 terabytes of virtual memory so it's definitely not 0.002x overhead. (perhaps the overhead amount changes depending on how close to a page level the size is, but it shouldn't be too much regardless)
This'd essentially be turning the preallocated TLS space as a memory allocation arena (and you could actually even just choose to provide an alloc+free interface for programs to dynamically allocate fs-relative-offsets to use for custom threadlocals?).
(then there's general problematicness of virtual memory; such PROT_NONE never-touched memory still counts towards virtual memory usage, which is annoying; browsers/Java/etc already suffer from this, but it'd be rather ugly for literally all processes to have such. I'd quite like a memory usage counter that includes all memory that is or ever was writable, but not PROT_NONE never-touched; i.e. how much memory the process can eventually require without running explicitly requesting more via syscalls, but afaik such just doesn't exist, or at least isn't a standard-displayed thing)
> how much memory the process can eventually require without running explicitly requesting more via syscalls
This concept is called "commit charge". Windows MM models it explicitly. Linux ought to as well. I agree it's a more useful concept than just address space allocated.
Interesting! Some searching later, looks like htop's DATA/M_DRS counter (i.e. second-to-last number in /proc/<PID>/statm) appears to count something related-ish; i.e. doesn't count a PROT_NONE mmap, but does a PROT_READ|PROT_WRITE untouched one; nothing in statm appears to count untouched writable MAP_SHARED though, though (potentially?-)shared mappings do get complicated in general.
Is MADV_FREE memory charged? It contributes to classic RSS, but can be discarded by the kernel if it deems that beneficial.
Some more experimentation later, it seems to be more like just counting PROT_WRITE+MAP_PRIVATE mappings or so; i.e. mprotect(PROT_NONE)ing (or even just PROT_READ) a writable region results in it not being counted, even if the region was modified and thus must actually be persisted. So it can actually get meaningfully lower than RSS. :/
MADV_FREE doesn't affect anything afaict.
Take a look at https://maskray.me/blog/2021-02-14-all-about-thread-local-st...
I think what you want is to force the "initial exec" model using an attribute to get the more efficient code.
IIRC this (or something equivalent) is what libGL does because basically every OpenGL function call needs to read the thread-local variable holding the current GL context.
The downside is that dlopen()ing your library may fail.
Doesn't the "initial exec" model require the user to run the executable with LD_PRELOAD=your.so or linking against your.so (DT_NEEDED)? eg would it work with a Python extension module which is dlopen'd at runtime?
It depends on the amount of TLS and the number of objects that do this, and the C library. In glibc, we have a small static TLS reservation dedicated for use by dlopen of shared objects with initial-exec.
For your specific use case, CPU-local (rather than thread-local) memory might also be an option. Search for rseq on Linux. Relatively fresh feature though, not a lot of experience around it exists yet.
This wouldn't have syntax sugar like thread_local, but presumably you're just calling some inlined helper function?
How do you avoid the problem of threads migrating between CPUs at arbitrary instruction boundaries?
> How do you avoid the problem of threads migrating between CPUs at arbitrary instruction boundaries?
rseq is short for restartable sequences; you mark beginning and end of a range of instructions, plus an abort path. The kernel checks this during task-switch and if you're within the range the instruction pointer is changed to the abort path on resumption.
(It's called restartable because the assumption is that the abort path will try again. Or at least, aborting the block of instructions midway through is recoverable.)
The primary limitation is that the "result" of the block in most cases needs to be concentrated into the last instruction of the block (similar to an atomic release write.) Otherwise you'd need to somehow recover from partially executed rseq blocks having changed some visible state but not fully completed.
Known users:
- LTTng tracer
- tcmalloc
Curious if there are other prominent users of rseq.
For a tracing profiler, you want to know which thread a function call or return was made by. LTTng has kernel modules which it can use to trace context switches, and then a per-CPU trace buffer is fine, provided that you get cheap atomic writes which rseq can be used for.
Funtrace on the other hand does support ftrace for tracing context switches (https://yosefk.com/blog/profiling-in-production-with-functio...), but it doesn't require ftrace for tracing function calls made by your threads (the problem with ftrace as well LTTng's kernel modules being, of course, permissions; which shouldn't be a problem in any reasonable situation by my standard of "reasonable", but many find themselves in unreasonable situations permissions-wise, sadly.) So I don't think funtrace can use rseq, though I might be missing something.
Presumably you could store the TID in every event, or otherwise check whether the TID has changed since the last time it was logged and push a (timestamp, TID) pair if so. Reading TID should be cheap.
In what sense should reading the TID be cheap? You would need either a syscall (not cheap) or thread-local storage (the subject of TFA.) Avoiding the use of TLS by reading the TID can't really work
It looks like the TID is stored directly in the pthread struct pointed to by %fs itself, at a fixed offset which you can somewhat-hackily compile into your code. [0]
In the process of investigating this, I also realized that there's a ton of other unique-per-thread pointers accessible from that structure, most notably including the value of %fs itself (which is unfortunately unobservable afaict), the address of the TCB or TLS structures, the stack guard value, etc. Since the goal is just to have a quickly-readable unique-per-thread value, any of those should work.
Windows looks similar, but I haven't investigated as deeply.
[0] https://github.com/andikleen/glibc/blob/b0399147730d478ae451...
[1] https://github.com/andikleen/glibc/blob/b0399147730d478ae451...
Be careful, we ran into a segfault with rseq when used in a DSO loaded by another DSO.
Hi, love the article. You mention in the article that a hardware mechanism for tracing should exist -- have you investigated the intel_pt (processor trace) extension? I believe this uses hardware buffers and supports timestamping & cycle counters (at somewhat larger than instruction granularity sadly, although it might issue forced stamps at on branches, not sure).
You can also use the PTWRITE instruction to attach metadata to the stream which seems very powerful.
Hope we can see such an extension on AMD as well.
Intel PT is indeed useful (although very, very slow compared to regular sampling profiling), but there's hardly any CPUs that actually implement PTWRITE. (IIRC there's some obscure Xeon or something?)
Typically you get a cycle count every six branches, give or take.
Sampling profilers are indeed very low-overhead, however they can't help debug tail latency, for which tracing profilers are indispensable:
https://yosefk.com/blog/profiling-in-production-with-functio...
https://danluu.com/perf-tracing/
Regarding the slowdown - magic-trace reports 2-10% slowdowns which IMO is actually fine even for production (unless this adds up to a huge dollar cost, for most people it won't) since in return for this you are actually capable to debug the rare slowdowns which are the worst part of your user experience.
However, the hardware feature that I propose (https://yosefk.com/blog/profiling-in-production-with-functio...) would likely have lower overhead since it relies on software issuing tracing instructions, eg at each function entry & exit (rather than any control flow change), and it could be variously selective (eg exclude short functions without loops; and/or you could configure the hardware to ignore short calls. BTW maybe you can with Intel Performance Trace, too, I'm just not really familiar with it.)
I discuss Intel Performance Trace in the writeup where I propose my much simpler hardware support for tracing: https://yosefk.com/blog/profiling-in-production-with-functio...
Like I said there, I'm frankly shocked that all CPUs haven't raced to implement similar features, that magic-trace which is built on top of Intel Performance Trace isn't used more widely, and that developers aren't insisting on running under magic-trace in production and requiring to deploy on Intel servers for that purpose.
The extension I propose is much simpler, and seems similar to what PTWRITE would do if it was the only feature in Intel Performance Trace. I have a lot of experience in chip architecture, and I believe that every CPU maker and every chip maker can support this easily - much more so than full feature parity with Intel Performance Trace. I hope they will!
One concern with PTWRITE is that it is somewhat "slow," at least according to this: https://community.intel.com/t5/Processors/Intel-Processor-Tr...
I wonder if this is a general issue relating to memory ordering or out-of-order execution, or whether this can be implemented more efficiently in a different extension.
Thank you for the linked article! Agreed on the huge potential for using these tools in production. The community could definitely benefit (even indirectly) by pushing for this kind of instruction set more widely.
GCC and Clang both provide `_readfsbase_u64()`, which presumably is just a wrapper around `__asm__("rdfsbase %0" : "=r"(result))`. Also gcc lets you define variables like `int__seg_fs *foo;`if the `__SEG_FS` preprocessor symbol is defined. With Clang, you can use `int __attribute__((address_space(257))) *foo`
Note that FSGSBASE support arrived ridiculously late in Linux, in Linux 5.9. It was first added to CPUs with Ivy Bridge: https://www.intel.com/content/www/us/en/developer/articles/t... AMD support is of similar vintage.
As a result of the later kernel support, it may not be for everyone to turn on. Until then, the replacement on GNU Linux is just to load %fs:0, which the x86-64 ABI requires to have the same value (or %gs:0 on i386). However, usually, for initial-exec TLS access, the address of variable is not required to be in a register.
Is SEGFS/%fs-based access slower than loading the base address with RDFSBASE (which may require spilling a register) and then using base+offset access? I haven't seen such reports.
__builtin_thread_pointer() is the non-x86-specific way to get the appropriate register
To the author: If you have a specific situation that the codegen is failing to optimize well (e.g. the ctor cases that you ran into) you can store the offset to your variable in a non-TLS global, then manually add the FS/GS to it. Use inline asm if you need to bypass any init checks.
An alternative that might be worth looking into is just hashing the FS/GS into a table index. It will be slower than the well-optimized case, but it will let you opt out of the TLS allocation process altogether. This might be a good thing in some cases for a low-level facility like a function tracer.
I have updated the post with all the suggestions people replied with except this one, because I don't understand it. How do I allocate memory for addressing it with FS/GS? Isn't FS the register pointing to the TLS area - then how is the FS-based access you propose different from how TLS normally works? Isn't GS used "for something" on x86?.. If you could elaborate on this/show some code I would be very grateful!
Note that my question is about shared libraries. If the thread_local is linked into an executable, I guess you could indeed save the offset somewhere and then add the value of %fs to it, though if this is a way to work around the constructor issue, I prefer to not have a constructor. The question is if this sort of direction can help for thread-local storage allocated by a shared library.
> How do I allocate memory for addressing it with FS/GS?
Allocate the variable normally, then compute the offset in one thread (e.g. offset = uintptr_t(&variable) - get_fs()), then access it by adding the offset to FS in any thread (e.g. (vartype *) (offset + get_fs())). The only difference from how it normally works is that you can manually force it to be inlined, sidestepping the codegen problems you described in your post. But if you can avoid those problems by not using constructors instead, that's definitely better.
I used "FS/GS" because GS is used instead of FS on some systems for the same purpose.
The shared library-specific issues are one of the reasons I was suggesting maybe looking into hashing, e.g. perhaps as a fallback solution when the TLS approach fails.
I think GP is talking about something like this [0]. You let it call __tls_get_addr() once in a constructor, take the offset from %fs, store it in a static variable, and use that offset directly. (The static variable doesn't need to be atomic, since it's only written to once, at dlopen() time.)
[0] https://godbolt.org/z/o6se3je8v
Is using pthread_key_create / pthread_setspecific / pthread_getspecific / pthread_key_delete better or worse or the same? As I understand it, this is the low level API you were looking for.
It's much worse, since you can't inline it; it becomes a shared-library call.
Ah, but you can "inline" it by hardcoding its implementation details into your code in the most horrible way imaginable, as suggested by a crafty reader and which, scarily enough, seems to be the fastest and most robust TLS access method from a shared library: https://yosefk.com/blog/cxx-thread-local-storage-performance...
iirc pthread uses the same ABI under the hood, only its compiled into the libc
Big fan. I wish you could write more. I am gonna go check the assembly of my own code for this. And I am going to integrate funtrace into my workflow.
Question about your build system comment: my build system doesn't stoop to figuring out if `-fPIC` is needed, but it also doesn't add it unless the user asks. Were you talking about that or build systems that add it automatically?
Thanks for you kind words!
I don't think build systems add -fPIC automatically, nor remove it automatically. C++ build systems do not stoop to the question of how to best build a C++ program, by and large. They are more task graph executors - either bad ones, like make, or good ones, like Bazel, but mostly task graph executors; the most "C++ support" you will get is native support for scanning #include files (as opposed to doing it yourself like make forces you to.)
Idk about Bazel, but Buck2 (and Buck1, but 2 is better in every way) does handle adding -fPIC for shared libraries and leaving it off when statically linking. Each library has a PIC and non-PIC version, and the appropriate one is selected depending on if the library is being linked into an executable or a shared library. (The versions are only built when needed.) I don't remember if it also handles -fPIE similarly smartly though if you do want position-independent executables.
> I don't think build systems add -fPIC automatically, nor remove it automatically.
GNU Autotools do, if you tell, that you have a shared library. You can also switch between static/dynamic linking at build time and if you also use GNU Libtool, it will figure out the flags of your build platform at build time.
> as opposed to doing it yourself like make forces you to
You can definitely don't have to do that. GNU Automake does it by default, but if you are using plain Make, you can also use makedepend or the appropriate flags of your compiler.
> either bad ones, like make
What is wrong/missing with make as a task graph executor?
> or good ones, like Bazel
What can Bazel do better?
Makes sense.
I want to build a "standard library" for my build system that would stoop to that. I can do that because my build system is not just a task graph executor; it is backed by a full programming language and can add its own libraries to that language. IOW, I can add a `cpp` package to the build system that implements support for how to best build a C++ program.
So if you have a wishlist for that support, I'd love to hear it.
Calling perf a sampling profiler is somewhat misleading; it can trace and count as well.
Could it be productive to report the slowness to the compiler (or ld) makers? Is this inherently bad or could it be fixed by the toolmakers?
> If I manage to stay out of trouble, it’s rarely because of knowing that much, but more because I’m relatively good at 2 other things: knowing what I don’t know, and knowing what I shouldn’t know.
> [...]
> I don’t know how to generalize the principle to make it explicit and easy to follow.
Coming from mathematics that is what I would call using the right level of abstraction.
If you want to prove 0 + 0 = 0 and you're getting tied up with stuff like how the direct sum of two Cauchy sequences should converge to the sum of the two limits then you're not working in the right level of abstraction. You're not supposed to know about Cauchy sequences yet if all you're given is the neutral element for addition.
In some rare cases it can help knowing about sublevels of abstraction. Such as the difference between a general linear space and one equipped with an inner product. Just because you can make an inner product doesn't mean you should, and if you don't you'll find some arguments a lot easier because you're not distracted by stuff like adjoints and orthonormal basis vectors etc. (one side effect is that gradient descent no longer works, and you really ought to know why). You can do similar things by refusing to decide on a coordinate system.
Regarding: `cmpb $0, %fs:__tls_guard@tpoff`, the per-function-call overhead is due to dynamic initialization on first use requirement:
> Block variables with static or thread(since C++11) storage duration are initialized the first time control passes through their declaration (unless their initialization is zero- or constant-initialization, which can be performed before the block is first entered). On all further calls, the declaration is skipped. --- https://en.cppreference.com/w/cpp/language/storage_duration
From https://maskray.me/blog/2021-02-14-all-about-thread-local-st...
> If you know x does not need dynamic initialization, C++20 constinit can make it as efficient as the plain old `__thread`. [[clang::require_constant_initialization]] can be used with older language standards.
Regarding `data16 lea tls_obj(%rip),%rdi` in the general-dynamic TLS model, yeah it's for linker optimization. The local-dynamic TLS model doesn't have data16 or rex prefixes.
Regarding "Why don’t we just use the same code as before — the movl instruction — with the dynamic linker substituting the right value for tls_obj@tpoff?"
Because -fpic/-fPIC was designed to support dlopen. The desired efficient GOTTPOFF code sequence is only feasible when the shared object is available at program start, in which case you can guarantee that "you would need the TLS areas of all the shared libraries to be allocated contiguously:"
With dlopen, the dynamic loader needs a different place for the TLS blocks of newly loaded shared libraries, which unfortunately requires one more indirection.Regarding "... and I don’t say a word about GL_TLS_GENERATION_OFFSET, for example, and I could."
`GL_TLS_GENERATION_OFFSET` in glibc is for the lazy TLS allocation scheme. I don't want to spend my valuable time on its implementation... It is almost infeasible to fix on the glibc side.
> the per-function-call overhead is due to dynamic initialization on first use requirement
Thanks - I didn’t realize this was mandated by the standard as opposed to “permitted” as one possibility (similarly to how eg a constructor of a global variable can be called before main or upon first use or anywhere in-between according to the standard). Updated the post with this point
> The desired efficient GOTTPOFF code sequence is only feasible when the shared object is available at program start, in which case you can guarantee that “you would need the TLS areas of all the shared libraries to be allocated contiguously”
Indeed I didn’t mention -ftls-model=initial-exec originally (I now added it based on reader feedback; it can work when it will work, which for my use case is a toss-up I guess…), but my point is that you could allocate the TLSes contiguously even if dlopen was used, and I describe how you could do it in the post, albeit in a somewhat hand-wavy way. This is totally not how things were done and I presume one reason is that you don’t carve out chunks of the address space for a use case like this as described in my approach - I just think it would be nice if things worked that way.
Actually sounds like it isn't mandated by the standard after all; it's mandated for block thread_locals but not for thread_locals in the global scope:
3.7.2/2 [basic.stc.thread]: A variable with thread storage duration shall be initialized before its first odr-use (3.2) and, if constructed, shall be destroyed on thread exit.
This allows the constructor to be called at any point before the first use, similarly to "normal" globals, though implementations made different tradeoffs in these 2 cases
I never investigated TLS perf nearly this deeply, but in a previous DB project we could only salvage perf by banning thread_local in favor of __thread, so it was impossible for TLS to reference anything but a bare scalar or pointer. We just rolled up all our TLS objects into one struct, heap-allocated that struct in the thread routine, and put a pointer to that struct in TLS. It was easy to write trivial accessors to these objects through the TLS pointer that were always inlined.
The way we do TLS now is too fiddly and accommodating of legacy code. What to keep TLS simple and fast? Treat it like an arena.
1. On each thread startup, including the main thread, carve out a huge chunk of address space, say 1GB, for that thread's TLS arena
2. on dlopen (or main program startup), allocate each loaded DSO's TLS out of the arena. Fail in the unlikely case that we run out of address space
3. on dlclose, recover committed memory using MADV_FREE on any now-unused parts of the TLS arena (once for each thread
4. to access a thread local, pull the offset of the variable out of the GOT and offset into the per thread arena. Nice and simple.
Does this approach waste address space? You bet. Will it work on 32 but systems? Absolutely not. Is it simple, fast, and robust? Yes.
Are there any example implementations of this?
Why aren't thread-local constructors called when the thread is created rather than doing a test on every variable access? Perhaps they wanted to have the OS-native pthread_create/CreateThread "work out of the box" without having to manually call `construct_thread_locals()` or whatever? That would be an insane tradeoff, especially considering `construct_thread_locals()` can be called automatically when using std::thread.
I think this, and all the other weirdness, is because usage of thread-local variables is pretty low. Probably somebody wanted to have "lightweight" threads doing work as quickly as possible accessing only local variables.
Edit: ah, see https://www.akkadia.org/drepper/tls.pdf which I was clued into by https://news.ycombinator.com/item?id=43079061
The situation is much more complicated than it sounds, because the question of where the variables are isn't known at compile time. It has to be done at link time. Which may include dynamic link time as a library is loaded. That all sounds fairly horrible.
All to handle static constructors. If you don't have thread-static variables, or they don't have constructors, it's much simpler.
I don't buy the static linking part - we already generate a single function in the linked binary that calls all static constructors from all translation units, the same thing can be done for TLS constructors (what I called construct_thread_locals() in parent), and force the user to call that manually on thread creation, or call it as part of std::thread.
Dynamic linking is a whole other beast, and maybe for thread_local's in a DLL it's acceptable (or the only possible way?) to do construct them lazily - if you're using DLL+TLS+global constructors you deserve the pain.
The problem is that dynamic linking is the norm. If you're writing a specific application, that's one thing. But if you're writing some random C++ library then you usually have no idea whether it will be compiled into an executable or a shared library. If TLS is slow in either of those cases, then the only solution is to not use TLS at all.
Because that can be very expensive. If you spawn a thread which does a little work and exits, having that also first need to initialize every thread local from every loaded library on the off chance that it might need that thread local would probably be not worth it.
Creating a thread is already very expensive, in a sane program your ten TLS constructors shouldn't come even close to the amount of bookkeeping that the kernel has to do to spin up a new thread.
Also creating a thread, doing little work and exiting immediately is a bad use of threads -- I don't think that everyone using thread_local should be punished to better support poorly written programs. Most threads are long-running and most programs spin up a bunch of worker threads at initialization time, I'd argue variable access time is way more important than thread startup time.
In a simple benchmark, creating a thread & waiting on it in a loop (pthread_create+pthread_join) takes 0.03ms = 30000ns, so running a bunch of constructors of a bunch of libraries could actually be a non-insignificant amount of time.
Do agree that threadlocal read speed should be a lot more important, but it'd quite suck if the "acceptable performance" thread usage approaches would vary dramatically depending on what libraries you've (or something else) has unrelatedly loaded. (though maybe other overheads might appear from other sources similarly, keeping this not the most important slowdown, idk)
If you make use of dynamic linking namespaces and some sort of runtime, you can make it impossible to enter specific libraries in certain libraries in certain threads and avoid that cost. If your application is large enough that you can't keep track of deps, then you may want something like that in your application if you're not willing to split it into multiple processes.
> Creating a thread is already very expensive, in a sane program your ten TLS constructors shouldn't come even close to the amount of bookkeeping that the kernel has to do to spin up a new thread.
This is not true, at least on Linux. With appropriate settings (e.g. a small stack) thread creation can be extremely cheap.
> Is this a deliberate performance tradeoff, benefitting code with lots of thread_locals and starting threads constantly, with each thread using few of the thread_locals, and some thread_locals having slow constructors? But such code isn't great to begin with?
Actually, thread globals only being used in some threads is probably the common case. Remember that threads are not only created by the main executable but often also by libraries. Keeping per-thread overhead low until the TLS variables are actually needed sounds like a good design goal.
> funtrace sidesteps the TLS constructor problem by interposing pthread_create, and initializing its thread_locals in its pthread_create wrapper
Sounds extremely fragile. What if someone calls clone() directly or another (possibly new) function to create a thread.
I dunno if extremely fragile; I recommend trying funtrace on your own code - pretty sure it will work out of the box! But, what's less fragile than interposing pthread_create which doesn't add a load and a branch to every function call and return which is where a tracing profiler will need to access TLS?..
In any case, one point of funtrace is that it's small (~1K LOC runtime) and you can tweak it easily, including calling its TLS init code from your threads if you don't create them with pthread_create "like most people" - even without this issue, people do "green threads" with setcontext/getcontext and who knows what else, which will need its own hacks to support, and so I think that an easily hackable runtime is a good thing given how hard it is to do this in a one-size-fits-all fashion.
As a counterexample, LLVM XRay, a tracing profiler from Google which is at least 10x bigger than funtrace (if you don't count the compiler instrumentation they introduced), took almost a decade to gain shared library support, and it's not done yet. So I think "small and hackable" has its advantages.
One way to avoid the extra code from constructors is to make them constexpr. This way the compiler can initialize it with 0 or a value and not generate all that extra code to execute a constructor at runtime.
I'm not sure how many people interested in this article is interested in C++ compile times but I once measured and wrote an article https://bolinlang.com/wheres-my-compile-time
I haven't fully dugged through the article, just skimmed over, looks like all good info all around.
I'm also mostly familiar with Windows, and on Windows until recently (few years ago), dynamically loading (e.g. with "LoadLibrary", e.g. "dlopen") a .dll caused issue with the .dll's own thread_locals. Microsoft fixed this, but folks have observed slower code
https://developercommunity.visualstudio.com/t/5x-performance...
To quote only the observed case there:
``` In VS2017 15.9.26 this executes in ~270ms and with VS2019 16.7.1 it takes ~1450ms. ```
Here are the notes too - https://learn.microsoft.com/en-us/cpp/overview/cpp-conforman...
> dynamically loading (e.g. with "LoadLibrary", e.g. "dlopen") a .dll caused issue with the .dll's own thread_locals.
Do you know what these issues were? I'm curious because I'm working on Pd (https://en.wikipedia.org/wiki/Pure_Data), which uses lots of thread local variables internally when built as a multi-instance library. Libpd itself may be loaded dynamically when embedded in an audio plugin. I'm not aware of any problems so far...
It used to be that .dlls loaded by the .exe on startup (e.g. implicitly listed there) would get their thread local vars correctly (TLS), but dlls loaded later (like /DELAYLOAD or through LoadLibrary) would not. (the workaround was to initialize these through TlsAlloc/TlsFree, and have hook in DllMain to clean up)
But then Microsoft added /Zc:tlsGuards - https://learn.microsoft.com/en-us/cpp/build/reference/zc-tls... - which is now the default that fixes the issue, but with some significant performance penalty (e.g. the "bug" that I've listed).
I guess you can't have it both ways easy...
On the clang/clang-cl side, there is https://clang.llvm.org/docs/ClangCommandLineReference.html#c...
to support this.
So check your compiler version and options :)
Also the notes posted here about CRT mixing might apply to you (not sure though) - https://learn.microsoft.com/en-us/cpp/porting/binary-compat-...
I work in a gamedev world, and plugins, ffi, delay loaded dlls etc. are constant pain that one needs to look and solve issues around.
So this was only a MSVC bug? Most people compile Pd with MinGW, which would explain why we never ran into this issue.
Do you happen to have a link to the original MSVC bug report (i.e. the wrong thread locals, not the performance regression)?
Note that MinGW uses libwinpthread, which is known to have slow TLS behavior anyway (I've observed a 100% overhead compared to running the same program under WSL using a linux-native GCC). c.f. https://github.com/msys2/MINGW-packages/discussions/13259
I haven't looked into it, but going through the release notes for tlsGuards showed this - though not directly a bug report
https://learn.microsoft.com/en-us/cpp/overview/cpp-conforman...
and also the implementation in "clang" (for "clang-cl" being conformant with MSVC) - https://reviews.llvm.org/D115456#3217595
then last year clang-cl also added ways to disable this (if need to), probably this hit some internal issue and had to be resolved. Maybe "thread_local" have become more widely used (unlike OS specific "TlsAlloc")
Thanks! Fortunately, this issue does not affect us because our thread locals are all zero initialized integers or pointers.
So we can avoid a lot of this pain by not using constructors for thread_local variables, right? We can roll our own version of this using a thread_local bool, or just interpose thread spawning as you've done.
As far as why it's worse with fPIC/shared: there are a variety of TLS models: General Dynamic, Local Dynamic, Initial Exec, and Local Exec. And they have different constraints / generality. The more general models are slower. IIRC IE/LE won't work with shared library thread_locals, but it's been a while so don't quote me on that.
Generally agree that it seems like the compiler could in theory be doing a better job in some of these circumstances.
> I’m sure there’s some dirty trick or other, based on knowing the guts of libc and other such, which, while dirty, is going to work for a long time, and where you can reasonably safely detect if it stopped working and upgrade it for whatever changes the guts of libc will have undergone. If you have an idea, please share it!
Find the existing TLS allocations, hope there's spare space at the end of the last page, and just map your variables there using %fs-relative accesses?
Always fun to see another high performance tracing implementation. We do some similar things at work (thread-local ringbuffers), though we aren't doing function entry/leave tracing.
> in my microbenchmark I get <10 ns per instrumented call or return
A significant portion of that is rdtsc time, right? Like 50-80%. Writing a few bytes to a ringbuffer prefetched in local cache is very very cheap but rdtsc takes ~4-8 nanos.
> While we're on the subject of snapshots - you can get trace data from a core dump by loading funtrace_gdb.py from gdb
Nice. We trace into a shared memory segment and then map it at start time, emitting a pre-crash trace during (next) program start. Maybe makes more sense for our use case (a specific server that is auto-restarted by orchestration) than a more general tracing system.
> A significant portion of that is rdtsc time, right? Like 50-80%. Writing a few bytes to a ringbuffer prefetched in local cache is very very cheap but rdtsc takes ~4-8 nanos.
Exactly right, though it seems totally insane to me coming from an embedded background where you get the cycle count in less than 1 ns but writing to the buffer would be "the" performance problem (and then you could eg avoid writing short calls as measured at runtime, but on x86 you will have spent too much time on the rdtsc for this to lower the overhead.) There's also RDPMC but it's not much faster and you need permissions(tm) to use it, plus it stops counting on various occasions which I never fully understood.
Regarding prefetching - what do you do perfetching-wise that helps performance?.. All my attempts to do better than the simplest store instructions did nothing to improve performance (I tried prefetchw/__builtin_prefetch, movntq/_mm_stream_pi and vmovntdq/_mm_stream_si128, all of them either didn't help or made things even slower)
> Regarding prefetching - what do you do perfetching-wise that helps performance?
Absolutely nothing -- the CPU internally just does a very good job predicting the ringbuffer write pattern, for obvious reasons.
An old joke:
I made an elegant program that counts from zero to zero. It took me a week.
This is missing out on considerations re. tls-model=local-dynamic and visibility; making variables non-preemptible by declaring them static or visibility("hidden"/"internal") significantly reduces TLS cost.
https://www.akkadia.org/drepper/tls.pdf is the go-to reference for TLS models, cf. first paragraph of each subsection of section 4.
-ftls-model=local-dynamic barely changes the generated code compared to the default (global-dynamic); you no longer have data16 prefixes around the call to __tls_get_addr (which means this call can't be replaced with something cheaper if you don't put the .o into a .so in the end) but otherwise it's the same.
-fvisiblity=hidden / __attribute__ ((visibility ("hidden"))) doesn't seem to do anything.
Are you sure that either of these speeds up TLS access for thread_locals defined in shared libraries?
If I remember correctly, local-dynamic is relevant if you access multiple thread-local variables, as the offset between them will be constant. The visibility attributes should allow the compiler to switch from global-dynamic to local-dynamic automatically. It's been a bit since I looked at all of this.
Also note that these sequences are highly architecture dependent and cost as well as cost differences will vary e.g. for ARM or ARM64.
I updated the post with the point on visibility; in my tests inspired by https://lobste.rs/s/b5dnjh/0_0_0_c_thread_local_storage_perf... I see that clang improves codegen thanks to hidden visibility but g++ does not, and that comment says clang doesn't do it on all platforms.
I stick to the broad conclusion that thread_locals without constructors linked into an executable rather than a shared library are the fastest and most performance-portable by far, but the visibility point is very worth mentioning.
As well as other costs. That we default to global visibility is another one of the ossified defaults the article mentions. One of the first things I look at when trying to improve the performance of a C++ project is its linking setup --- most of the time, people are using the terrible ELF defaults and paying in code size and execution time for something they don't need.
It's because of stuff like this that it's so important to regularly break ABI.
I can confirm that C (GCC and Clang) will generate this:
But they did not inline the function that does that, even with `__attribute__((always_inline))` and link-time optimization. I need to investigate that.Is this from a shared library (compiled with -fPIC and linked with -shared) though? That's where the big troubles start and I think C isn't better than C++ here
No, this is an executable without `-fPIC`.
I am grateful that I only use one multi-call executable (built from a monorepo made partially because of your post, btw) because I think shared libraries in C might have the same sort of problems, just without constructors.
One more entitled developer with a claim that something is slow but can't really prove it because he can't measure it nor does he have any other working code to compare how blazingly fast can it be. Gotcha.
Tip to the author: you can implement a lock-free buffer pool cache for as many threads you need in ~200 LoC. Another tip: it's not gonna be faster but you may get more control over the resources
At the bottom of the article the author provides links to regression reports. In one workload calculating the thread local address was taking 10% of execution time up from 1%. The burden of proof would seem to be on you.
No, that's not how it works. The burden of proof is on the one who is claiming the regression or design flaw, and in this case I am not sure what exactly the author can even claim? He did not run a single experiment where he would show that a non-inlined ctr call is creating a runtime pressure on his fancy profiler. Not a single experiment. Not a single data. This is a pure form of cargo-cult programming not engineering.
The unfortunate effect of this and similar blogs is that tomorrow you will have to deal with someone at your work who happened to read this and such material on the web and will take it for granted. He will take it further and implement an "optimization" yet when you ask to demonstrate the problem he was solving you will get big nothing. Just as in this case.
You didn't read the article, did you. Not to mention being spectacularly rude about it.
No, I actually read the article and I don't think I was spectacularly rude. I was rather just straight to the point. And rather than bitching "omg, why, oh why did a compiler turn a ctr into a function-call here?!" what I think I gave is a constructive suggestion how to fix the problem, if there was even one to begin with.
What I see here in the article is pure theoreticizing and dwelling over details which in 99.9% of the cases do _not_ matter. There wasn't a single experiment shown by the author by which he shows that his use-case is in that 0.01%.
I was basically done with reading the article when I read the following nonsense
> accessing an extern thread_local with a constructor involves a function call, with the function testing the guard variable of the translation unit where the thread_local is defined.
And then
> But with inlining, the fast path is quite fast on a good processor.
Right, the fast path of inlined ctr call vs slow path of non-inlined ctr call. Really? I call this a BS, especially without the experiments showing the data.
Maybe you should write your own blog post that counters the author’s so that others can learn from your experience.
I have no such aspirations. Commenting here when I have time and to learn something and not to prove something to somebody else.