1# Android ELF TLS (Draft) 2 3Internal links: 4 * [go/android-elf-tls](http://go/android-elf-tls) 5 * [One-pager](https://docs.google.com/document/d/1leyPTnwSs24P2LGiqnU6HetnN5YnDlZkihigi6qdf_M) 6 * Tracking bugs: http://b/110100012, http://b/78026329 7 8[TOC] 9 10# Overview 11 12ELF TLS is a system for automatically allocating thread-local variables with cooperation among the 13compiler, linker, dynamic loader, and libc. 14 15Thread-local variables are declared in C and C++ with a specifier, e.g.: 16 17```cpp 18thread_local int tls_var; 19``` 20 21At run-time, TLS variables are allocated on a module-by-module basis, where a module is a shared 22object or executable. At program startup, TLS for all initially-loaded modules comprises the "Static 23TLS Block". TLS variables within the Static TLS Block exist at fixed offsets from an 24architecture-specific thread pointer (TP) and can be accessed very efficiently -- typically just a 25few instructions. TLS variables belonging to dlopen'ed shared objects, on the other hand, may be 26allocated lazily, and accessing them typically requires a function call. 27 28# Thread-Specific Memory Layout 29 30Ulrich Drepper's ELF TLS document specifies two ways of organizing memory pointed at by the 31architecture-specific thread-pointer ([`__get_tls()`] in Bionic): 32 33![TLS Variant 1 Layout](img/tls-variant1.png) 34 35![TLS Variant 2 Layout](img/tls-variant2.png) 36 37Variant 1 places the static TLS block after the TP, whereas variant 2 places it before the TP. 38According to Drepper, variant 2 was motivated by backwards compatibility, and variant 1 was designed 39for Itanium. The choice has effects on the toolchain, loader, and libc. In particular, when linking 40an executable, the linker needs to know where an executable's TLS segment is relative to the TP so 41it can correctly relocate TLS accesses. Both variants are incompatible with Bionic's current 42thread-specific data layout, but variant 1 is more problematic than variant 2. 43 44Each thread has a "Dynamic Thread Vector" (DTV) with a pointer to each module's TLS block (or NULL 45if it hasn't been allocated yet). If the executable has a TLS segment, then it will always be module 461, and its storage will always be immediately after (or before) the TP. In variant 1, the TP is 47expected to point immediately at the DTV pointer, whereas in variant 2, the DTV pointer's offset 48from TP is implementation-defined. 49 50The DTV's "generation" field is used to lazily update/reallocate the DTV when new modules are loaded 51or unloaded. 52 53[`__get_tls()`]: https://android.googlesource.com/platform/bionic/+/7245c082658182c15d2a423fe770388fec707cbc/libc/private/__get_tls.h 54 55# Access Models 56 57When a C/C++ file references a TLS variable, the toolchain generates instructions to find its 58address using a TLS "access model". The access models trade generality against efficiency. The four 59models are: 60 61 * GD: General Dynamic (aka Global Dynamic) 62 * LD: Local Dynamic 63 * IE: Initial Exec 64 * LE: Local Exec 65 66A TLS variable may be in a different module than the reference. 67 68## General Dynamic (or Global Dynamic) (GD) 69 70A GD access can refer to a TLS variable anywhere. To access a variable `tls_var` using the 71"traditional" non-TLSDESC design described in Drepper's TLS document, the toolchain compiler emits a 72call to a `__tls_get_addr` function provided by libc. 73 74For example, if we have this C code in a shared object: 75 76```cpp 77extern thread_local char tls_var; 78char* get_tls_var() { 79 return &tls_var; 80} 81``` 82 83The toolchain generates code like this: 84 85```cpp 86struct TlsIndex { 87 long module; // starts counting at 1 88 long offset; 89}; 90 91char* get_tls_var() { 92 static TlsIndex tls_var_idx = { // allocated in the .got 93 R_TLS_DTPMOD(tls_var), // dynamic TP module ID 94 R_TLS_DTPOFF(tls_var), // dynamic TP offset 95 }; 96 return __tls_get_addr(&tls_var_idx); 97} 98``` 99 100`R_TLS_DTPMOD` is a dynamic relocation to the index of the module containing `tls_var`, and 101`R_TLS_DTPOFF` is a dynamic relocation to the offset of `tls_var` within its module's `PT_TLS` 102segment. 103 104`__tls_get_addr` looks up `TlsIndex::module_id`'s entry in the DTV and adds `TlsIndex::offset` to 105the module's TLS block. Before it can do this, it ensures that the module's TLS block is allocated. 106A simple approach is to allocate memory lazily: 107 1081. If the current thread's DTV generation count is less than the current global TLS generation, then 109 `__tls_get_addr` may reallocate the DTV or free blocks for unloaded modules. 110 1112. If the DTV's entry for the given module is `NULL`, then `__tls_get_addr` allocates the module's 112 memory. 113 114If an allocation fails, `__tls_get_addr` calls `abort` (like emutls). 115 116musl, on the other, preallocates TLS memory in `pthread_create` and in `dlopen`, and each can report 117out-of-memory. 118 119## Local Dynamic (LD) 120 121LD is a specialization of GD that's useful when a function has references to two or more TLS 122variables that are both part of the same module as the reference. Instead of a call to 123`__tls_get_addr` for each variable, the compiler calls `__tls_get_addr` once to get the current 124module's TLS block, then adds each variable's DTPOFF to the result. 125 126For example, suppose we have this C code: 127 128```cpp 129static thread_local int x; 130static thread_local int y; 131int sum() { 132 return x + y; 133} 134``` 135 136The toolchain generates code like this: 137 138```cpp 139int sum() { 140 static TlsIndex tls_module_idx = { // allocated in the .got 141 // a dynamic relocation against symbol 0 => current module ID 142 R_TLS_DTPMOD(NULL), 143 0, 144 }; 145 char* base = __tls_get_addr(&tls_module_idx); 146 // These R_TLS_DTPOFF() relocations are resolved at link-time. 147 int* px = base + R_TLS_DTPOFF(x); 148 int* py = base + R_TLS_DTPOFF(y); 149 return *px + *py; 150} 151``` 152 153(XXX: LD might be important for C++ `thread_local` variables -- even a single `thread_local` 154variable with a dynamic initializer has an associated TLS guard variable.) 155 156## Initial Exec (IE) 157 158If the variable is part of the Static TLS Block (i.e. the executable or an initially-loaded shared 159object), then its offset from the TP is known at load-time. The variable can be accessed with a few 160loads. 161 162Example: a C file for an executable: 163 164```cpp 165// tls_var could be defined in the executable, or it could be defined 166// in a shared object the executable links against. 167extern thread_local char tls_var; 168char* get_addr() { return &tls_var; } 169``` 170 171Compiles to: 172 173```cpp 174// allocated in the .got, resolved at load-time with a dynamic reloc. 175// Unlike DTPOFF, which is relative to the start of the module’s block, 176// TPOFF is directly relative to the thread pointer. 177static long tls_var_gotoff = R_TLS_TPOFF(tls_var); 178 179char* get_addr() { 180 return (char*)__get_tls() + tls_var_gotoff; 181} 182``` 183 184## Local Exec (LE) 185 186LE is a specialization of IE. If the variable is not just part of the Static TLS Block, but is also 187part of the executable (and referenced from the executable), then a GOT access can be avoided. The 188IE example compiles to: 189 190```cpp 191char* get_addr() { 192 // R_TLS_TPOFF() is resolved at (static) link-time 193 return (char*)__get_tls() + R_TLS_TPOFF(tls_var); 194} 195``` 196 197## Selecting an Access Model 198 199The compiler selects an access model for each variable reference using these factors: 200 * The absence of `-fpic` implies an executable, so use IE/LE. 201 * Code compiled with `-fpic` could be in a shared object, so use GD/LD. 202 * The per-file default can be overridden with `-ftls-model=<model>`. 203 * Specifiers on the variable (`static`, `extern`, ELF visibility attributes). 204 * A variable can be annotated with `__attribute__((tls_model(...)))`. Clang may still use a more 205 efficient model than the one specified. 206 207# Shared Objects with Static TLS 208 209Shared objects are sometimes compiled with `-ftls-model=initial-exec` (i.e. "static TLS") for better 210performance. On Ubuntu, for example, `libc.so.6` and `libOpenGL.so.0` are compiled this way. Shared 211objects using static TLS can't be loaded with `dlopen` unless libc has reserved enough surplus 212memory in the static TLS block. glibc reserves a kilobyte or two (`TLS_STATIC_SURPLUS`) with the 213intent that only a few core system libraries would use static TLS. Non-core libraries also sometimes 214use it, which can break `dlopen` if the surplus area is exhausted. See: 215 * https://bugzilla.redhat.com/show_bug.cgi?id=1124987 216 * web search: [`"dlopen: cannot load any more object with static TLS"`][glibc-static-tls-error] 217 218Neither musl nor the Bionic TLS prototype currently allocate any surplus TLS memory. 219 220In general, supporting surplus TLS memory probably requires maintaining a thread list so that 221`dlopen` can initialize the new static TLS memory in all existing threads. A thread list could be 222omitted if the loader only allowed zero-initialized TLS segments and didn't reclaim memory on 223`dlclose`. 224 225As long as a shared object is one of the initially-loaded modules, a better option is to use 226TLSDESC. 227 228[glibc-static-tls-error]: https://www.google.com/search?q=%22dlopen:+cannot+load+any+more+object+with+static+TLS%22 229 230# TLS Descriptors (TLSDESC) 231 232The code fragments above match the "traditional" TLS design from Drepper's document. For the GD and 233LD models, there is a newer, more efficient design that uses "TLS descriptors". Each TLS variable 234reference has a corresponding descriptor, which contains a resolver function address and an argument 235to pass to the resolver. 236 237For example, if we have this C code in a shared object: 238 239```cpp 240extern thread_local char tls_var; 241char* get_tls_var() { 242 return &tls_var; 243} 244``` 245 246The toolchain generates code like this: 247 248```cpp 249struct TlsDescriptor { // NB: arm32 reverses these fields 250 long (*resolver)(long); 251 long arg; 252}; 253 254char* get_tls_var() { 255 // allocated in the .got, uses a dynamic relocation 256 static TlsDescriptor desc = R_TLS_DESC(tls_var); 257 return (char*)__get_tls() + desc.resolver(desc.arg); 258} 259``` 260 261The dynamic loader fills in the TLS descriptors. For a reference to a variable allocated in the 262Static TLS Block, it can use a simple resolver function: 263 264```cpp 265long static_tls_resolver(long arg) { 266 return arg; 267} 268``` 269 270The loader writes `tls_var@TPOFF` into the descriptor's argument. 271 272To support modules loaded with `dlopen`, the loader must use a resolver function that calls 273`__tls_get_addr`. In principle, this simple implementation would work: 274 275```cpp 276long dynamic_tls_resolver(TlsIndex* arg) { 277 return (long)__tls_get_addr(arg) - (long)__get_tls(); 278} 279``` 280 281There are optimizations that complicate the design a little: 282 * Unlike `__tls_get_addr`, the resolver function has a special calling convention that preserves 283 almost all registers, reducing register pressure in the caller 284 ([example](https://godbolt.org/g/gywcxk)). 285 * In general, the resolver function must call `__tls_get_addr`, so it must save and restore all 286 registers. 287 * To keep the fast path fast, the resolver inlines the fast path of `__tls_get_addr`. 288 * By storing the module's initial generation alongside the TlsIndex, the resolver function doesn't 289 need to use an atomic or synchronized access of the global TLS generation counter. 290 291The resolver must be written in assembly, but in C, the function looks like so: 292 293```cpp 294struct TlsDescDynamicArg { 295 unsigned long first_generation; 296 TlsIndex idx; 297}; 298 299struct TlsDtv { // DTV == dynamic thread vector 300 unsigned long generation; 301 char* modules[]; 302}; 303 304long dynamic_tls_resolver(TlsDescDynamicArg* arg) { 305 TlsDtv* dtv = __get_dtv(); 306 char* addr; 307 if (dtv->generation >= arg->first_generation && 308 dtv->modules[arg->idx.module] != nullptr) { 309 addr = dtv->modules[arg->idx.module] + arg->idx.offset; 310 } else { 311 addr = __tls_get_addr(&arg->idx); 312 } 313 return (long)addr - (long)__get_tls(); 314} 315``` 316 317The loader needs to allocate a table of `TlsDescDynamicArg` objects for each TLS module with dynamic 318TLSDESC relocations. 319 320The static linker can still relax a TLSDESC-based access to an IE/LE access. 321 322The traditional TLS design is implemented everywhere, but the TLSDESC design has less toolchain 323support: 324 * GCC and the BFD linker support both designs on all supported Android architectures (arm32, arm64, 325 x86, x86-64). 326 * GCC can select the design at run-time using `-mtls-dialect=<dialect>` (`trad`-vs-`desc` on arm64, 327 otherwise `gnu`-vs-`gnu2`). Clang always uses the default mode. 328 * GCC and Clang default to TLSDESC on arm64 and the traditional design on other architectures. 329 * Gold and LLD support for TLSDESC is spotty (except when targeting arm64). 330 331# Linker Relaxations 332 333The (static) linker frequently has more information about the location of a referenced TLS variable 334than the compiler, so it can "relax" TLS accesses to more efficient models. For example, if an 335object file compiled with `-fpic` is linked into an executable, the linker could relax GD accesses 336to IE or LE. To relax a TLS access, the linker looks for an expected sequences of instructions and 337static relocations, then replaces the sequence with a different one of equal size. It may need to 338add or remove no-op instructions. 339 340## Current Support for GD->LE Relaxations Across Linkers 341 342Versions tested: 343 * BFD and Gold linkers: version 2.30 344 * LLD version 6.0.0 (upstream) 345 346Linker support for GD->LE relaxation with `-mtls-dialect=gnu/trad` (traditional): 347 348Architecture | BFD | Gold | LLD 349--------------- | --- | ---- | --- 350arm32 | no | no | no 351arm64 (unusual) | yes | yes | no 352x86 | yes | yes | yes 353x86_64 | yes | yes | yes 354 355Linker support for GD->LE relaxation with `-mtls-dialect=gnu2/desc` (TLSDESC): 356 357Architecture | BFD | Gold | LLD 358--------------------- | --- | ------------------ | ------------------ 359arm32 (experimental) | yes | unsupported relocs | unsupported relocs 360arm64 | yes | yes | yes 361x86 (experimental) | yes | yes | unsupported relocs 362X86_64 (experimental) | yes | yes | unsupported relocs 363 364arm32 linkers can't relax traditional TLS accesses. BFD can relax an arm32 TLSDESC access, but LLD 365can't link code using TLSDESC at all, except on arm64, where it's used by default. 366 367# dlsym 368 369Calling `dlsym` on a TLS variable returns the address of the current thread's variable. 370 371# Debugger Support 372 373## gdb 374 375gdb uses a libthread_db plugin library to retrieve thread-related information from a target. This 376library is typically a shared object, but for Android, we link our own `libthread_db.a` into 377gdbserver. We will need to implement at least 2 APIs in `libthread_db.a` to find TLS variables, and 378gdb provides APIs for looking up symbols, reading or writing memory, and retrieving the current 379thread pointer (e.g. `ps_get_thread_area`). 380 * Reference: [gdb_proc_service.h]: APIs gdb provides to libthread_db 381 * Reference: [Currently unimplemented TLS functions in Android's libthread_tb][libthread_db.c] 382 383[gdb_proc_service.h]: https://android.googlesource.com/toolchain/gdb/+/a7e49fd02c21a496095c828841f209eef8ae2985/gdb-8.0.1/gdb/gdb_proc_service.h#41 384[libthread_db.c]: https://android.googlesource.com/platform/ndk/+/e1f0ad12fc317c0ca3183529cc9625d3f084d981/sources/android/libthread_db/libthread_db.c#115 385 386## LLDB 387 388LLDB more-or-less implemented Linux TLS debugging in [r192922][rL192922] ([D1944]) for x86 and 389x86-64. [arm64 support came later][D5073]. However, the Linux TLS functionality no longer does 390anything: the `GetThreadPointer` function is no longer implemented. Code for reading the thread 391pointer was removed in [D10661] ([this function][r240543]). (arm32 was apparently never supported.) 392 393[rL192922]: https://reviews.llvm.org/rL192922 394[D1944]: https://reviews.llvm.org/D1944 395[D5073]: https://reviews.llvm.org/D5073 396[D10661]: https://reviews.llvm.org/D10661 397[r240543]: https://github.com/llvm-mirror/lldb/commit/79246050b0f8d6b54acb5366f153d07f235d2780#diff-52dee3d148892cccfcdab28bc2165548L962 398 399## Threading Library Metadata 400 401Both debuggers need metadata from the threading library (`libc.so` / `libpthread.so`) to find TLS 402variables. From [LLDB r192922][rL192922]'s commit message: 403 404> ... All OSes use basically the same algorithm (a per-module lookup table) as detailed in Ulrich 405> Drepper's TLS ELF ABI document, so we can easily write code to decode it ourselves. The only 406> question therefore is the exact field layouts required. Happily, the implementors of libpthread 407> expose the structure of the DTV via metadata exported as symbols from the .so itself, designed 408> exactly for this kind of thing. So this patch simply reads that metadata in, and re-implements 409> libthread_db's algorithm itself. We thereby get cross-platform TLS lookup without either requiring 410> third-party libraries, while still being independent of the version of libpthread being used. 411 412 LLDB uses these variables: 413 414Name | Notes 415--------------------------------- | --------------------------------------------------------------------------------------- 416`_thread_db_pthread_dtvp` | Offset from TP to DTV pointer (0 for variant 1, implementation-defined for variant 2) 417`_thread_db_dtv_dtv` | Size of a DTV slot (typically/always sizeof(void*)) 418`_thread_db_dtv_t_pointer_val` | Offset within a DTV slot to the pointer to the allocated TLS block (typically/always 0) 419`_thread_db_link_map_l_tls_modid` | Offset of a `link_map` field containing the module's 1-based TLS module ID 420 421The metadata variables are local symbols in glibc's `libpthread.so` symbol table (but not its 422dynamic symbol table). Debuggers can access them, but applications can't. 423 424The debugger lookup process is straightforward: 425 * Find the `link_map` object and module-relative offset for a TLS variable. 426 * Use `_thread_db_link_map_l_tls_modid` to find the TLS variable's module ID. 427 * Read the target thread pointer. 428 * Use `_thread_db_pthread_dtvp` to find the thread's DTV. 429 * Use `_thread_db_dtv_dtv` and `_thread_db_dtv_t_pointer_val` to find the desired module's block 430 within the DTV. 431 * Add the module-relative offset to the module pointer. 432 433This process doesn't appear robust in the face of lazy DTV initialization -- presumably it could 434read past the end of an out-of-date DTV or access an unloaded module. To be robust, it needs to 435compare a module's initial generation count against the DTV's generation count. (XXX: Does gdb have 436these sorts of problems with glibc's libpthread?) 437 438## Reading the Thread Pointer with Ptrace 439 440There are ptrace interfaces for reading the thread pointer for each of arm32, arm64, x86, and x86-64 441(XXX: check 32-vs-64-bit for inferiors, debuggers, and kernels): 442 * arm32: `PTRACE_GET_THREAD_AREA` 443 * arm64: `PTRACE_GETREGSET`, `NT_ARM_TLS` 444 * x86_32: `PTRACE_GET_THREAD_AREA` 445 * x86_64: use `PTRACE_PEEKUSER` to read the `{fs,gs}_base` fields of `user_regs_struct` 446 447# C/C++ Specifiers 448 449C/C++ TLS variables are declared with a specifier: 450 451Specifier | Notes 452--------------- | ----------------------------------------------------------------------------------------------------------------------------- 453`__thread` | - non-standard, but ubiquitous in GCC and Clang<br/> - cannot have dynamic initialization or destruction 454`_Thread_local` | - a keyword standardized in C11<br/> - cannot have dynamic initialization or destruction 455`thread_local` | - C11: a macro for `_Thread_local` via `threads.h`<br/> - C++11: a keyword, allows dynamic initialization and/or destruction 456 457The dynamic initialization and destruction of C++ `thread_local` variables is layered on top of ELF 458TLS (or emutls), so this design document mostly ignores it. Like emutls, ELF TLS variables either 459have a static initializer or are zero-initialized. 460 461Aside: Because a `__thread` variable cannot have dynamic initialization, `__thread` is more 462efficient in C++ than `thread_local` when the compiler cannot see the definition of a declared TLS 463variable. The compiler assumes the variable could have a dynamic initializer and generates code, at 464each access, to call a function to initialize the variable. 465 466# Graceful Failure on Old Platforms 467 468ELF TLS isn't implemented on older Android platforms, so dynamic executables and shared objects 469using it generally won't work on them. Ideally, the older platforms would reject these binaries 470rather than experience memory corruption at run-time. 471 472Static executables aren't a problem--the necessary runtime support is part of the executable, so TLS 473just works. 474 475XXX: Shared objects are less of a problem. 476 * On arm32, x86, and x86_64, the loader [should reject a TLS relocation]. (XXX: I haven't verified 477 this.) 478 * On arm64, the primary TLS relocation (R_AARCH64_TLSDESC) is [confused with an obsolete 479 R_AARCH64_TLS_DTPREL32 relocation][R_AARCH64_TLS_DTPREL32] and is [quietly ignored]. 480 * Android P [added compatibility checks] for TLS symbols and `DT_TLSDESC_{GOT|PLT}` entries. 481 482XXX: A dynamic executable using ELF TLS would have a PT_TLS segment and no other distinguishing 483marks, so running it on an older platform would result in memory corruption. Should we add something 484to these executables that only newer platforms recognize? (e.g. maybe an entry in .dynamic, a 485reference to a symbol only a new libc.so has...) 486 487[should reject a TLS relocation]: https://android.googlesource.com/platform/bionic/+/android-8.1.0_r48/linker/linker.cpp#2852 488[R_AARCH64_TLS_DTPREL32]: https://android-review.googlesource.com/c/platform/bionic/+/723696 489[quietly ignored]: https://android.googlesource.com/platform/bionic/+/android-8.1.0_r48/linker/linker.cpp#2784 490[added compatibility checks]: https://android-review.googlesource.com/c/platform/bionic/+/648760 491 492# Bionic Prototype Notes 493 494There is an [ELF TLS prototype] uploaded on Gerrit. It implements: 495 * Static TLS Block allocation for static and dynamic executables 496 * TLS for dynamically loaded and unloaded modules (`__tls_get_addr`) 497 * TLSDESC for arm64 only 498 499Missing: 500 * `dlsym` of a TLS variable 501 * debugger support 502 503[ELF TLS prototype]: https://android-review.googlesource.com/q/topic:%22elf-tls-prototype%22+(status:open%20OR%20status:merged) 504 505## Loader/libc Communication 506 507The loader exposes a list of TLS modules ([`struct TlsModules`][TlsModules]) to `libc.so` using the 508`__libc_shared_globals` variable (see `tls_modules()` in [linker_tls.cpp][tls_modules-linker] and 509[elf_tls.cpp][tls_modules-libc]). `__tls_get_addr` in libc.so acquires the `TlsModules::mutex` and 510iterates its module list to lazily allocate and free TLS blocks. 511 512[TlsModules]: https://android-review.googlesource.com/c/platform/bionic/+/723698/1/libc/bionic/elf_tls.h#53 513[tls_modules-linker]: https://android-review.googlesource.com/c/platform/bionic/+/723698/1/linker/linker_tls.cpp#45 514[tls_modules-libc]: https://android-review.googlesource.com/c/platform/bionic/+/723698/1/libc/bionic/elf_tls.cpp#49 515 516## TLS Allocator 517 518The prototype currently allocates a `pthread_internal_t` object and static TLS in a single mmap'ed 519region, along with a thread's stack if it needs one allocated. It doesn't place TLS memory on a 520preallocated stack (either the main thread's stack or one provided with `pthread_attr_setstack`). 521 522The DTV and blocks for dlopen'ed modules are instead allocated using the Bionic loader's 523`LinkerMemoryAllocator`, adapted to avoid the STL and to provide `memalign`. The prototype tries to 524achieve async-signal safety by blocking signals and acquiring a lock. 525 526There are three "entry points" to dynamically locate a TLS variable's address: 527 * libc.so: `__tls_get_addr` 528 * loader: TLSDESC dynamic resolver 529 * loader: dlsym 530 531The loader's entry points need to call `__tls_get_addr`, which needs to allocate memory. Currently, 532the prototype uses a [special function pointer] to call libc.so's `__tls_get_addr` from the loader. 533(This should probably be removed.) 534 535The prototype currently allows for arbitrarily-large TLS variable alignment. IIRC, different 536implementations (glibc, musl, FreeBSD) vary in their level of respect for TLS alignment. It looks 537like the Bionic loader ignores segments' alignment and aligns loaded libraries to 256 KiB. See 538`ReserveAligned`. 539 540[special function pointer]: https://android-review.googlesource.com/c/platform/bionic/+/723698/1/libc/private/bionic_globals.h#52 541 542## Async-Signal Safety 543 544The prototype's `__tls_get_addr` might be async-signal safe. Making it AS-safe is a good idea if 545it's feasible. musl's function is AS-safe, but glibc's isn't (or wasn't). Google had a patch to make 546glibc AS-safe back in 2012-2013. See: 547 * https://sourceware.org/glibc/wiki/TLSandSignals 548 * https://sourceware.org/ml/libc-alpha/2012-06/msg00335.html 549 * https://sourceware.org/ml/libc-alpha/2013-09/msg00563.html 550 551## Out-of-Memory Handling (abort) 552 553The prototype lazily allocates TLS memory for dlopen'ed modules (see `__tls_get_addr`), and an 554out-of-memory error on a TLS access aborts the process. musl, on the other hand, preallocates TLS 555memory on `pthread_create` and `dlopen`, so either function can return out-of-memory. Both functions 556probably need to acquire the same lock. 557 558Maybe Bionic should do the same as musl? Perhaps musl's robustness argument holds for Bionic, 559though, because Bionic (at least the linker) probably already aborts on OOM. musl doesn't support 560`dlclose`/unloading, so it might have an easier time. 561 562On the other hand, maybe lazy allocation is a feature, because not all threads will use a dlopen'ed 563solib's TLS variables. Drepper makes this argument in his TLS document: 564 565> In addition the run-time support should avoid creating the thread-local storage if it is not 566> necessary. For instance, a loaded module might only be used by one thread of the many which make 567> up the process. It would be a waste of memory and time to allocate the storage for all threads. A 568> lazy method is wanted. This is not much extra burden since the requirement to handle dynamically 569> loaded objects already requires recognizing storage which is not yet allocated. This is the only 570> alternative to stopping all threads and allocating storage for all threads before letting them run 571> again. 572 573FWIW: emutls also aborts on out-of-memory. 574 575## ELF TLS Not Usable in libc 576 577The dynamic loader currently can't use ELF TLS, so any part of libc linked into the loader (i.e. 578most of it) also can't use ELF TLS. It might be possible to lift this restriction, perhaps with 579specialized `__tls_get_addr` and TLSDESC resolver functions. 580 581# Open Issues 582 583## Bionic Memory Layout Conflicts with Common TLS Layout 584 585Bionic already allocates thread-specific data in a way that conflicts with TLS variants 1 and 2: 586![Bionic TLS Layout in Android P](img/bionic-tls-layout-in-p.png) 587 588TLS variant 1 allocates everything after the TP to ELF TLS (except the first two words), and variant 5892 allocates everything before the TP. Bionic currently allocates memory before and after the TP to 590the `pthread_internal_t` struct. 591 592The `bionic_tls.h` header is marked with a warning: 593 594```cpp 595/** WARNING WARNING WARNING 596 ** 597 ** This header file is *NOT* part of the public Bionic ABI/API 598 ** and should not be used/included by user-serviceable parts of 599 ** the system (e.g. applications). 600 ** 601 ** It is only provided here for the benefit of the system dynamic 602 ** linker and the OpenGL sub-system (which needs to access the 603 ** pre-allocated slot directly for performance reason). 604 **/ 605``` 606 607There are issues with rearranging this memory: 608 609 * `TLS_SLOT_STACK_GUARD` is used for `-fstack-protector`. The location (word #5) was initially used 610 by GCC on x86 (and x86-64), where it is compatible with x86's TLS variant 2. We [modified Clang 611 to use this slot for arm64 in 2016][D18632], though, and the slot isn't compatible with ARM's 612 variant 1 layout. This change shipped in NDK r14, and the NDK's build systems (ndk-build and the 613 CMake toolchain file) enable `-fstack-protector-strong` by default. 614 615 * `TLS_SLOT_TSAN` is used for more than just TSAN -- it's also used by [HWASAN and 616 Scudo](https://reviews.llvm.org/D53906#1285002). 617 618 * The Go runtime allocates a thread-local "g" variable on Android by creating a pthread key and 619 searching for its TP-relative offset, which it assumes is nonnegative: 620 * On arm32/arm64, it creates a pthread key, sets it to a magic value, then scans forward from 621 the thread pointer looking for it. [The scan count was bumped to 384 to fix a reported 622 breakage happening with Android N.](https://go-review.googlesource.com/c/go/+/38636) (XXX: I 623 suspect the actual platform breakage happened with Android M's [lock-free pthread key 624 work][bionic-lockfree-keys].) 625 * On x86/x86-64, it uses a fixed offset from the thread pointer (TP+0xf8 or TP+0x1d0) and 626 creates pthread keys until one of them hits the fixed offset. 627 * CLs: 628 * arm32: https://codereview.appspot.com/106380043 629 * arm64: https://go-review.googlesource.com/c/go/+/17245 630 * x86: https://go-review.googlesource.com/c/go/+/16678 631 * x86-64: https://go-review.googlesource.com/c/go/+/15991 632 * Moving the pthread keys before the thread pointer breaks Go-based apps. 633 * It's unclear how many Android apps use Go. There are at least two with 1,000,000+ installs. 634 * [Some motivation for Go's design][golang-post], [runtime/HACKING.md][go-hacking] 635 * [On x86/x86-64 Darwin, Go uses a TLS slot reserved for both Go and Wine][go-darwin-x86] (On 636 [arm32][go-darwin-arm32]/[arm64][go-darwin-arm64] Darwin, Go scans for pthread keys like it 637 does on Android.) 638 639 * Android's "native bridge" system allows the Zygote to load an app solib of a non-native ABI. (For 640 example, it could be used to load an arm32 solib into an x86 Zygote.) The solib is translated 641 into the host architecture. TLS accesses in the app solib (whether ELF TLS, Bionic slots, or 642 `pthread_internal_t` fields) become host accesses. Laying out TLS memory differently across 643 architectures could complicate this translation. 644 645 * A `pthread_t` is practically just a `pthread_internal_t*`, and some apps directly access the 646 `pthread_internal_t::tid` field. Past examples: http://b/17389248, [aosp/107467]. Reorganizing 647 the initial `pthread_internal_t` fields could break those apps. 648 649It seems easy to fix the incompatibility for variant 2 (x86 and x86_64) by splitting out the Bionic 650slots into a new data structure. Variant 1 is a harder problem. 651 652The TLS prototype currently uses a patched LLD that uses a variant 1 TLS layout with a 16-word TCB 653on all architectures. 654 655Aside: gcc's arm64ilp32 target uses a 32-bit unsigned offset for a TLS IE access 656(https://godbolt.org/z/_NIXjF). If Android ever supports this target, and in a configuration with 657variant 2 TLS, we might need to change the compiler to emit a sign-extending load. 658 659[D18632]: https://reviews.llvm.org/D18632 660[bionic-lockfree-keys]: https://android-review.googlesource.com/c/platform/bionic/+/134202 661[golang-post]: https://groups.google.com/forum/#!msg/golang-nuts/EhndTzcPJxQ/i-w7kAMfBQAJ 662[go-hacking]: https://github.com/golang/go/blob/master/src/runtime/HACKING.md 663[go-darwin-x86]: https://github.com/golang/go/issues/23617 664[go-darwin-arm32]: https://github.com/golang/go/blob/15c106d99305411b587ec0d9e80c882e538c9d47/src/runtime/cgo/gcc_darwin_arm.c 665[go-darwin-arm64]: https://github.com/golang/go/blob/15c106d99305411b587ec0d9e80c882e538c9d47/src/runtime/cgo/gcc_darwin_arm64.c 666[aosp/107467]: https://android-review.googlesource.com/c/platform/bionic/+/107467 667 668### Workaround: Use Variant 2 on arm32/arm64 669 670Pros: simplifies Bionic 671 672Cons: 673 * arm64: requires either subtle reinterpretation of a TLS relocation or addition of a new 674 relocation 675 * arm64: a new TLS relocation reduces compiler/assembler compatibility with non-Android 676 677The point of variant 2 was backwards-compatibility, and ARM Android needs to remain 678backwards-compatible, so we could use variant 2 for ARM. Problems: 679 680 * When linking an executable, the static linker needs to know how TLS is allocated because it 681 writes TP-relative offsets for IE/LE-model accesses. Clang doesn't tell the linker to target 682 Android, so it could pass an `--tls-variant2` flag to configure lld. 683 684 * On arm64, there are different sets of static LE relocations accommodating different ranges of 685 offsets from TP: 686 687 Size | TP offset range | Static LE relocation types 688 ---- | ----------------- | --------------------------------------- 689 12 | 0 <= x < 2^12 | `R_AARCH64_TLSLE_ADD_TPREL_LO12` 690 " | " | `R_AARCH64_TLSLE_LDST8_TPREL_LO12` 691 " | " | `R_AARCH64_TLSLE_LDST16_TPREL_LO12` 692 " | " | `R_AARCH64_TLSLE_LDST32_TPREL_LO12` 693 " | " | `R_AARCH64_TLSLE_LDST64_TPREL_LO12` 694 " | " | `R_AARCH64_TLSLE_LDST128_TPREL_LO12` 695 16 | -2^16 <= x < 2^16 | `R_AARCH64_TLSLE_MOVW_TPREL_G0` 696 24 | 0 <= x < 2^24 | `R_AARCH64_TLSLE_ADD_TPREL_HI12` 697 " | " | `R_AARCH64_TLSLE_ADD_TPREL_LO12_NC` 698 " | " | `R_AARCH64_TLSLE_LDST8_TPREL_LO12_NC` 699 " | " | `R_AARCH64_TLSLE_LDST16_TPREL_LO12_NC` 700 " | " | `R_AARCH64_TLSLE_LDST32_TPREL_LO12_NC` 701 " | " | `R_AARCH64_TLSLE_LDST64_TPREL_LO12_NC` 702 " | " | `R_AARCH64_TLSLE_LDST128_TPREL_LO12_NC` 703 32 | -2^32 <= x < 2^32 | `R_AARCH64_TLSLE_MOVW_TPREL_G1` 704 " | " | `R_AARCH64_TLSLE_MOVW_TPREL_G0_NC` 705 48 | -2^48 <= x < 2^48 | `R_AARCH64_TLSLE_MOVW_TPREL_G2` 706 " | " | `R_AARCH64_TLSLE_MOVW_TPREL_G1_NC` 707 " | " | `R_AARCH64_TLSLE_MOVW_TPREL_G0_NC` 708 709 GCC for arm64 defaults to the 24-bit model and has an `-mtls-size=SIZE` option for setting other 710 supported sizes. (It supports 12, 24, 32, and 48.) Clang has only implemented the 24-bit model, 711 but that could change. (Clang [briefly used][D44355] load/store relocations, but it was reverted 712 because no linker supported them: [BFD], [Gold], [LLD]). 713 714 The 16-, 32-, and 48-bit models use a `movn/movz` instruction to set the highest 16 bits to a 715 positive or negative value, then `movk` to set the remaining 16 bit chunks. In principle, these 716 relocations should be able to accommodate a negative TP offset. 717 718 The 24-bit model uses `add` to set the high 12 bits, then places the low 12 bits into another 719 `add` or a load/store instruction. 720 721Maybe we could modify the `R_AARCH64_TLSLE_ADD_TPREL_HI12` relocation to allow a negative TP offset 722by converting the relocated `add` instruction to a `sub`. Alternately, we could add a new 723`R_AARCH64_TLSLE_SUB_TPREL_HI12` relocation, and Clang would use a different TLS LE instruction 724sequence when targeting Android/arm64. 725 726 * LLD's arm64 relaxations from GD and IE to LE would need to use `movn` instead of `movk` for 727 Android. 728 729 * Binaries linked with the flag crash on non-Bionic, and binaries without the flag crash on Bionic. 730 We might want to mark the binaries somehow to indicate the non-standard TLS ABI. Suggestion: 731 * Use an `--android-tls-variant2` flag (or `--bionic-tls-variant2`, we're trying to make [Bionic 732 run on the host](http://b/31559095)) 733 * Add a `PT_ANDROID_TLS_TPOFF` segment? 734 * Add a [`.note.gnu.property`](https://reviews.llvm.org/D53906#1283425) with a 735 "`GNU_PROPERTY_TLS_TPOFF`" property value? 736 737[D44355]: https://reviews.llvm.org/D44355 738[BFD]: https://sourceware.org/bugzilla/show_bug.cgi?id=22970 739[Gold]: https://sourceware.org/bugzilla/show_bug.cgi?id=22969 740[LLD]: https://bugs.llvm.org/show_bug.cgi?id=36727 741 742### Workaround: Reserve an Extra-Large TCB on ARM 743 744Pros: Minimal linker change, no change to TLS relocations. 745Cons: The reserved amount becomes an arbitrary but immutable part of the Android ABI. 746 747Add an lld option: `--android-tls[-tcb=SIZE]` 748 749As with the first workaround, we'd probably want to mark the binary to indicate the non-standard 750TP-to-TLS-segment offset. 751 752Reservation amount: 753 * We would reserve at least 6 words to cover the stack guard 754 * Reserving 16 covers all the existing Bionic slots and gives a little room for expansion. (If we 755 ever needed more than 16 slots, we could allocate the space before TP.) 756 * 16 isn't enough for the pthread keys, so the Go runtime is still a problem. 757 * Reserving 138 words is enough for existing slots and pthread keys. 758 759### Workaround: Use Variant 1 Everywhere with an Extra-Large TCB 760 761Pros: 762 * memory layout is the same on all architectures, avoids native bridge complications 763 * x86/x86-64 relocations probably handle positive offsets without issue 764 765Cons: 766 * The reserved amount is still arbitrary. 767 768### Workaround: No LE Model in Android Executables 769 770Pros: 771 * Keeps options open. We can allow LE later if we want. 772 * Bionic's existing memory layout doesn't change, and arm32 and 32-bit x86 have the same layout 773 * Fixes everything but static executables 774 775Cons: 776 * more intrusive toolchain changes (affects both Clang and LLD) 777 * statically-linked executables still need another workaround 778 * somewhat larger/slower executables (they must use IE, not LE) 779 780The layout conflict is apparently only a problem because an executable assumes that its TLS segment 781is located at a statically-known offset from the TP (i.e. it uses the LE model). An initially-loaded 782shared object can still use the efficient IE access model, but its TLS segment offset is known at 783load-time, not link-time. If we can guarantee that Android's executables also use the IE model, not 784LE, then the Bionic loader can place the executable's TLS segment at any offset from the TP, leaving 785the existing thread-specific memory layout untouched. 786 787This workaround doesn't help with statically-linked executables, but they're probably less of a 788problem, because the linker and `libc.a` are usually packaged together. 789 790A likely problem: LD is normally relaxed to LE, not to IE. We'd either have to disable LD usage in 791the compiler (bad for performance) or add LD->IE relaxation. This relaxation requires that IE code 792sequences be no larger than LD code sequences, which may not be the case on some architectures. 793(XXX: In some past testing, it looked feasible for TLSDESC but not the traditional design.) 794 795To implement: 796 * Clang would need to stop generating LE accesses. 797 * LLD would need to relax GD and LD to IE instead of LE. 798 * LLD should abort if it sees a TLS LE relocation. 799 * LLD must not statically resolve an executable's IE relocation in the GOT. (It might assume that 800 it knows its value.) 801 * Perhaps LLD should mark executables specially, because a normal ELF linker's output would quietly 802 trample on `pthread_internal_t`. We need something like `DF_STATIC_TLS`, but instead of 803 indicating IE in an solib, we want to indicate the lack of LE in an executable. 804 805### (Non-)workaround for Go: Allocate a Slot with Go's Magic Values 806 807The Go runtime allocates its thread-local "g" variable by searching for a hard-coded magic constant 808(`0x23581321` for arm32 and `0x23581321345589` for arm64). As long as it finds its constant at a 809small positive offset from TP (within the first 384 words), it will think it has found the pthread 810key it allocated. 811 812As a temporary compatibility hack, we might try to keep these programs running by reserving a TLS 813slot with this magic value. This hack doesn't appear to work, however. The runtime finds its pthread 814key, but apps segfault. Perhaps the Go runtime expects its "g" variable to be zero-initialized ([one 815example][go-tlsg-zero]). With this hack, it's never zero, but with its current allocation strategy, 816it is typically zero. After [Bionic's pthread key system was rewritten to be 817lock-free][bionic-lockfree-keys] for Android M, though, it's not guaranteed, because a key could be 818recycled. 819 820[go-tlsg-zero]: https://go.googlesource.com/go/+/5bc1fd42f6d185b8ff0201db09fb82886978908b/src/runtime/asm_arm64.s#980 821 822### Workaround for Go: place pthread keys after the executable's TLS 823 824Most Android executables do not use any `thread_local` variables. In the current prototype, with the 825AOSP hikey960 build, only `/system/bin/netd` has a TLS segment, and it's only 32 bytes. As long as 826`/system/bin/app_process{32,64}` limits its use of TLS memory, then the pthread keys could be 827allocated after `app_process`' TLS segment, and Go will still find them. 828 829Go scans 384 words from the thread pointer. If there are at most 16 Bionic slots and 130 pthread 830keys (2 words per key), then `app_process` can use at most 108 words of TLS memory. 831 832Drawback: In principle, this might make pthread key accesses slower, because Bionic can't assume 833that pthread keys are at a fixed offset from the thread pointer anymore. It must load an offset from 834somewhere (a global variable, another TLS slot, ...). `__get_thread()` already uses a TLS slot to 835find `pthread_internal_t`, though, rather than assume a fixed offset. (XXX: I think it could be 836optimized.) 837 838## TODO: Memory Layout Querying APIs (Proposed) 839 840 * https://sourceware.org/glibc/wiki/ThreadPropertiesAPI 841 * http://b/30609580 842 843## TODO: Sanitizers 844 845XXX: Maybe a sanitizer would want to intercept allocations of TLS memory, and that could be hard if 846the loader is allocating it. 847 * It looks like glibc's ld.so re-relocates itself after loading a program, so a program's symbols 848 can interpose call in the loader: https://sourceware.org/ml/libc-alpha/2014-01/msg00501.html 849 850# References 851 852General (and x86/x86-64) 853 * Ulrich Drepper's TLS document, ["ELF Handling For Thread-Local Storage."][drepper] Describes the 854 overall ELF TLS design and ABI details for x86 and x86-64 (as well as several other architectures 855 that Android doesn't target). 856 * Alexandre Oliva's TLSDESC proposal with details for x86 and x86-64: ["Thread-Local Storage 857 Descriptors for IA32 and AMD64/EM64T."][tlsdesc-x86] 858 * [x86 and x86-64 SystemV psABIs][psabi-x86]. 859 860arm32: 861 * Alexandre Oliva's TLSDESC proposal for arm32: ["Thread-Local Storage Descriptors for the ARM 862 platform."][tlsdesc-arm] 863 * ["Addenda to, and Errata in, the ABI for the ARM® Architecture."][arm-addenda] Section 3, 864 "Addendum: Thread Local Storage" has details for arm32 non-TLSDESC ELF TLS. 865 * ["Run-time ABI for the ARM® Architecture."][arm-rtabi] Documents `__aeabi_read_tp`. 866 * ["ELF for the ARM® Architecture."][arm-elf] List TLS relocations (traditional and TLSDESC). 867 868arm64: 869 * [2015 LLVM bugtracker comment][llvm22408] with an excerpt from an unnamed ARM draft specification 870 describing arm64 code sequences necessary for linker relaxation 871 * ["ELF for the ARM® 64-bit Architecture (AArch64)."][arm64-elf] Lists TLS relocations (traditional 872 and TLSDESC). 873 874[drepper]: https://www.akkadia.org/drepper/tls.pdf 875[tlsdesc-x86]: https://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt 876[psabi-x86]: https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI 877[tlsdesc-arm]: https://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-ARM.txt 878[arm-addenda]: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0045e/IHI0045E_ABI_addenda.pdf 879[arm-rtabi]: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0043d/IHI0043D_rtabi.pdf 880[arm-elf]: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0044f/IHI0044F_aaelf.pdf 881[llvm22408]: https://bugs.llvm.org/show_bug.cgi?id=22408#c10 882[arm64-elf]: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0056b/IHI0056B_aaelf64.pdf 883