i did some testing trying to switch malloc to use the new internal
lock with priority inheritance, and my malloc contention test got
20-100 times slower. if priority inheritance futexes are this slow,
it's simply too high a price to pay for avoiding priority inversion.
maybe we can consider them somewhere down the road once the kernel
folks get their act together on this (and perferably don't link it to
glibc's inefficient lock API)...
as such, i've switch __lock to use malloc's implementation of
lightweight locks, and updated all the users of the code to use an
array with a waiter count for their locks. this should give optimal
performance in the vast majority of cases, and it's simple.
malloc is still using its own internal copy of the lock code because
it seems to yield measurably better performance with -O3 when it's
inlined (20% or more difference in the contention stress test).
why does this affect behavior? well, the linker seems to traverse
archive files starting from its current position when resolving
symbols. since calloc.c comes alphabetically (and thus in sequence in
the archive file) between __simple_malloc.c and malloc.c, attempts to
resolve the "malloc" symbol for use by calloc.c were pulling in the
full malloc.c implementation rather than the __simple_malloc.c
implementation.
as of now, lite_malloc.c and malloc.c are adjacent in the archive and
in the correct order, so malloc.c should never be used to resolve
"malloc" unless it's already needed to resolve another symbol ("free"
or "realloc").
this change is made with some reluctance, but i think it's for the
best. correct programs must handle either behavior, so there is little
advantage to having malloc(0) return NULL. and i managed to actually
make the malloc code slightly smaller with this change.
do not allow allocations that overflow ptrdiff_t; fix some overflow
checks that were not quite right but didn't matter due to address
layout implementation.