commit 06fbefd100 (first included in
release 1.1.17) introduced this regression.
patch by Adrian Bunk. it fixes the regression in all cases, but
spuriously prevents use of the clz instruction on very old compiler
versions that don't define __ARM_ARCH. this may be fixed in a more
general way at some point in the future. it also omits thumb1 logic
since building as thumb1 code is currently not supported.
counts leading zero bits of a 64bit int, undefined on zero input.
(has nothing to do with atomics, added to atomic.h so target specific
helper functions are together.)
there is a logarithmic generic implementation and another in terms of
a 32bit a_clz_32 on targets where that's available.
three problems are addressed:
- use of pc arithmetic, which was difficult if not impossible to make
correct in thumb mode on all models, so that relative rather than
absolute pointers to the backends could be used. this was designed
back when there was no coherent model for the early stages of the
dynamic linker before relocations, and is no longer necessary.
- assumption that data (the relative pointers to the backends) can be
accessed at a constant displacement from the code. this will not be
possible on future fdpic subarchs (for cortex-m), so move
responsibility for loading the backend code address to the caller.
- hard-coded arm opcodes using the .word directive. instead, use the
.arch directive to work around the assembler's refusal to assemble
instructions not available (or in some cases, available but just
considered deprecated) in the target isa level. the obscure v6t2
arch is used for v6 code so as to (1) allow generation of thumb2
output if -mthumb is active, and (2) avoid warnings/errors for mcr
barriers that clang would produce if we just set arch to v7-a.
in addition, the __aeabi_read_tp function is moved out of the inner
workings and implemented as an asm wrapper around a C function, so
that asm code does not need to read global data. the asm wrapper
serves to satisfy the ABI calling convention requirements for this
function.
"Q" input constraint was used for the written object, instead of "=Q"
output constraint. this should not cause problems because "memory"
is on the clobber list, but "=Q" better documents the intent and more
consistent with the actual asm code.
this changes the generated code, because different registers are used,
but other than the register names nothing should change.
contrary to commit 89e149d275, big
endian arm does need the instruction bytes in big endian order. rather
than trying to use a special encoding that works as arm or thumb,
simply encode the simplest/canonical undefined instructions dependent
on whether __thumb__ is defined.
the .byte directive encodes a guaranteed-undefined instruction, the
same one Linux fills the kuser helper page with when it's disabled.
the udf mnemonic and and .insn directives are not supported by old
binutils versions, and larger-than-byte integer directives would
produce the wrong output on big-endian.
switch to ll/sc model so that new atomic.h can provide optimized
versions of all the atomic primitives without needing an ll/sc loop
written in asm for each one.
all isa levels which use ldrex/strex now use the inline ll/sc model
even if the type of barrier to use is not known until runtime (v6).
the cas model is only used for arm v5 and earlier, and it has been
optimized to make the call via inline asm with custom constraints
rather than as a C function call.
rather than having each arch provide its own atomic.h, there is a new
shared atomic.h in src/internal which pulls arch-specific definitions
from arc/$(ARCH)/atomic_arch.h. the latter can be extremely minimal,
defining only a_cas or new ll/sc type primitives which the shared
atomic.h will use to construct everything else.
this commit avoids making heavy changes to the individual archs'
atomic implementations. definitions which are identical or
near-identical to what the new shared atomic.h would produce have been
removed, but otherwise the changes made are just hooking up the
arch-specific files to the new infrastructure. major changes to take
advantage of the new system will come in subsequent commits.
previously, builds for pre-armv6 targets hard-coded use of the "kuser
helper" system for atomics and thread-pointer access, resulting in
binaries that fail to run (crash) on systems where this functionality
has been disabled (as a security/hardening measure) in the kernel.
additionally, builds for armv6 hard-coded an outdated/deprecated
memory barrier instruction which may require emulation (extremely
slow) on future models.
this overhaul replaces the behavior for all pre-armv7 builds (both of
the above cases) to perform runtime detection of the appropriate
mechanisms for barrier, atomic compare-and-swap, and thread pointer
access. detection is based on information provided by the kernel in
auxv: presence of the HWCAP_TLS bit for AT_HWCAP and the architecture
version encoded in AT_PLATFORM. direct use of the instructions is
preferred when possible, since probing for the existence of the kuser
helper page would be difficult and would incur runtime cost.
for builds targeting armv7 or later, the runtime detection code is not
compiled at all, and much more efficient versions of the non-cas
atomic operations are provided by using ldrex/strex directly rather
than wrapping cas.
conceptually, a_spin needs to be at least a compiler barrier, so the
compiler will not optimize out loops (and the load on each iteration)
while spinning. it should also be a memory barrier, or the spinning
thread might keep spinning without noticing stores from other threads,
thus delaying for longer than it should.
ideally, an optimal a_spin implementation that avoids unnecessary
cache/memory contention should be chosen for each arch, but for now,
the easiest thing is to perform a useless a_cas on the calling
thread's stack.
the a_cas_l, a_swap_l, a_swap_p, and a_store_l operations were
probably used a long time ago when only i386 and x86_64 were
supported. as other archs were added, support for them was
inconsistent, and they are obviously not in use at present. having
them around potentially confuses readers working on new ports, and the
type-punning hacks and inconsistent use of types in their definitions
is not a style I wish to perpetuate in the source tree, so removing
them seems appropriate.
armv7/thumb2 provides a way to do atomics in thumb mode, but for armv6
we need a call to arm mode.
this commit is based on a patch by Stephen Thomas which fixed the
armv7 cases but not the armv6 ones.
all of this should be revisited if/when runtime selection of thread
pointer access and atomics are added.
aside from potentially offering better performance, this change is
needed since the old coprocessor-based approach to barriers is
deprecated in arm v7, and some compilers/assemblers issue errors when
using the deprecated instruction for v7 targets.
the "m" constraint could give a memory reference with an offset that's
not compatible with ldrex/strex, so the arm-specific "Q" constraint is
needed instead.
this is perhaps not the optimal implementation; a_cas still compiles
to nested loops due to the different interface contracts of the kuser
helper cas function (whose contract this patch implements) and the
a_cas function (whose contract mimics the x86 cmpxchg). fixing this
may be possible, but it's more complicated and thus deferred until a
later time.
aside from improving performance and code size, this patch also
provides a means of producing binaries which can run on hardened
kernels where the kuser helpers have been disabled. however, at
present this requires producing binaries for armv6k or later, which
will not run on older cpus. a real solution to the problem of kernels
that omit the kuser helpers would be runtime detection, so that
universal binaries which run on all arm cpu models can also be
compatible with all kernel hardening profiles. robust detection
however is a much harder problem, and will be addressed at a later
time.
atomic store was lacking a barrier, which was fine for legacy arm with
no real smp and kernel-emulated cas, but unsuitable for more modern
systems. the kernel provides another "kuser" function, at 0xffff0fa0,
which could be used for the barrier, but using that would drop support
for kernels 2.6.12 through 2.6.14 unless an extra conditional were
added to check for barrier availability. just using the barrier in the
kernel cas is easier, and, based on my reading of the assembly code in
the kernel, does not appear to be significantly slower.
at the same time, other atomic operations are adapted to call the
kernel cas function directly rather than using a_cas; due to small
differences in their interface contracts, this makes the generated
code much simpler.
this port assumes eabi calling conventions, eabi linux syscall
convention, and presence of the kernel helpers at 0xffff0f?0 needed
for threads support. otherwise it makes very few assumptions, and the
code should work even on armv4 without thumb support, as well as on
systems with thumb interworking. the bits headers declare this a
little endian system, but as far as i can tell the code should work
equally well on big endian.
some small details are probably broken; so far, testing has been
limited to qemu/aboriginal linux.