Library Jemalloc updated to 5.0.1 (#721)

2026-01-18 03:15:41 +00:00 · 2017-11-26 17:46:07 +01:00
parent 5ef601b7cf
commit d87134b8ea
121 changed files with 29264 additions and 9335 deletions
--- a/modules/worldengine/deps/jemalloc/ChangeLog
+++ b/modules/worldengine/deps/jemalloc/ChangeLog
@@ -1,10 +1,727 @@
 Following are change highlights associated with official releases.  Important
-bug fixes are all mentioned, but internal enhancements are omitted here for
-brevity (even though they are more fun to write about).  Much more detail can be
-found in the git revision history:
+bug fixes are all mentioned, but some internal enhancements are omitted here for
+brevity.  Much more detail can be found in the git revision history:

    https://github.com/jemalloc/jemalloc

+* 5.0.1 (July 1, 2017)
+
+  This bugfix release fixes several issues, most of which are obscure enough
+  that typical applications are not impacted.
+
+  Bug fixes:
+  - Update decay->nunpurged before purging, in order to avoid potential update
+    races and subsequent incorrect purging volume.  (@interwq)
+  - Only abort on dlsym(3) error if the failure impacts an enabled feature (lazy
+    locking and/or background threads).  This mitigates an initialization
+    failure bug for which we still do not have a clear reproduction test case.
+    (@interwq)
+  - Modify tsd management so that it neither crashes nor leaks if a thread's
+    only allocation activity is to call free() after TLS destructors have been
+    executed.  This behavior was observed when operating with GNU libc, and is
+    unlikely to be an issue with other libc implementations.  (@interwq)
+  - Mask signals during background thread creation.  This prevents signals from
+    being inadvertently delivered to background threads.  (@jasone,
+    @davidgoldblatt, @interwq)
+  - Avoid inactivity checks within background threads, in order to prevent
+    recursive mutex acquisition.  (@interwq)
+  - Fix extent_grow_retained() to use the specified hooks when the
+    arena.<i>.extent_hooks mallctl is used to override the default hooks.
+    (@interwq)
+  - Add missing reentrancy support for custom extent hooks which allocate.
+    (@interwq)
+  - Post-fork(2), re-initialize the list of tcaches associated with each arena
+    to contain no tcaches except the forking thread's.  (@interwq)
+  - Add missing post-fork(2) mutex reinitialization for extent_grow_mtx.  This
+    fixes potential deadlocks after fork(2).  (@interwq)
+  - Enforce minimum autoconf version (currently 2.68), since 2.63 is known to
+    generate corrupt configure scripts.  (@jasone)
+  - Ensure that the configured page size (--with-lg-page) is no larger than the
+    configured huge page size (--with-lg-hugepage).  (@jasone)
+
+* 5.0.0 (June 13, 2017)
+
+  Unlike all previous jemalloc releases, this release does not use naturally
+  aligned "chunks" for virtual memory management, and instead uses page-aligned
+  "extents".  This change has few externally visible effects, but the internal
+  impacts are... extensive.  Many other internal changes combine to make this
+  the most cohesively designed version of jemalloc so far, with ample
+  opportunity for further enhancements.
+
+  Continuous integration is now an integral aspect of development thanks to the
+  efforts of @davidtgoldblatt, and the dev branch tends to remain reasonably
+  stable on the tested platforms (Linux, FreeBSD, macOS, and Windows).  As a
+  side effect the official release frequency may decrease over time.
+
+  New features:
+  - Implement optional per-CPU arena support; threads choose which arena to use
+    based on current CPU rather than on fixed thread-->arena associations.
+    (@interwq)
+  - Implement two-phase decay of unused dirty pages.  Pages transition from
+    dirty-->muzzy-->clean, where the first phase transition relies on
+    madvise(... MADV_FREE) semantics, and the second phase transition discards
+    pages such that they are replaced with demand-zeroed pages on next access.
+    (@jasone)
+  - Increase decay time resolution from seconds to milliseconds.  (@jasone)
+  - Implement opt-in per CPU background threads, and use them for asynchronous
+    decay-driven unused dirty page purging.  (@interwq)
+  - Add mutex profiling, which collects a variety of statistics useful for
+    diagnosing overhead/contention issues.  (@interwq)
+  - Add C++ new/delete operator bindings.  (@djwatson)
+  - Support manually created arena destruction, such that all data and metadata
+    are discarded.  Add MALLCTL_ARENAS_DESTROYED for accessing merged stats
+    associated with destroyed arenas.  (@jasone)
+  - Add MALLCTL_ARENAS_ALL as a fixed index for use in accessing
+    merged/destroyed arena statistics via mallctl.  (@jasone)
+  - Add opt.abort_conf to optionally abort if invalid configuration options are
+    detected during initialization.  (@interwq)
+  - Add opt.stats_print_opts, so that e.g. JSON output can be selected for the
+    stats dumped during exit if opt.stats_print is true.  (@jasone)
+  - Add --with-version=VERSION for use when embedding jemalloc into another
+    project's git repository.  (@jasone)
+  - Add --disable-thp to support cross compiling.  (@jasone)
+  - Add --with-lg-hugepage to support cross compiling.  (@jasone)
+  - Add mallctl interfaces (various authors):
+    + background_thread
+    + opt.abort_conf
+    + opt.retain
+    + opt.percpu_arena
+    + opt.background_thread
+    + opt.{dirty,muzzy}_decay_ms
+    + opt.stats_print_opts
+    + arena.<i>.initialized
+    + arena.<i>.destroy
+    + arena.<i>.{dirty,muzzy}_decay_ms
+    + arena.<i>.extent_hooks
+    + arenas.{dirty,muzzy}_decay_ms
+    + arenas.bin.<i>.slab_size
+    + arenas.nlextents
+    + arenas.lextent.<i>.size
+    + arenas.create
+    + stats.background_thread.{num_threads,num_runs,run_interval}
+    + stats.mutexes.{ctl,background_thread,prof,reset}.
+      {num_ops,num_spin_acq,num_wait,max_wait_time,total_wait_time,max_num_thds,
+      num_owner_switch}
+    + stats.arenas.<i>.{dirty,muzzy}_decay_ms
+    + stats.arenas.<i>.uptime
+    + stats.arenas.<i>.{pmuzzy,base,internal,resident}
+    + stats.arenas.<i>.{dirty,muzzy}_{npurge,nmadvise,purged}
+    + stats.arenas.<i>.bins.<j>.{nslabs,reslabs,curslabs}
+    + stats.arenas.<i>.bins.<j>.mutex.
+      {num_ops,num_spin_acq,num_wait,max_wait_time,total_wait_time,max_num_thds,
+      num_owner_switch}
+    + stats.arenas.<i>.lextents.<j>.{nmalloc,ndalloc,nrequests,curlextents}
+    + stats.arenas.i.mutexes.{large,extent_avail,extents_dirty,extents_muzzy,
+      extents_retained,decay_dirty,decay_muzzy,base,tcache_list}.
+      {num_ops,num_spin_acq,num_wait,max_wait_time,total_wait_time,max_num_thds,
+      num_owner_switch}
+
+  Portability improvements:
+  - Improve reentrant allocation support, such that deadlock is less likely if
+    e.g. a system library call in turn allocates memory.  (@davidtgoldblatt,
+    @interwq)
+  - Support static linking of jemalloc with glibc.  (@djwatson)
+
+  Optimizations and refactors:
+  - Organize virtual memory as "extents" of virtual memory pages, rather than as
+    naturally aligned "chunks", and store all metadata in arbitrarily distant
+    locations.  This reduces virtual memory external fragmentation, and will
+    interact better with huge pages (not yet explicitly supported).  (@jasone)
+  - Fold large and huge size classes together; only small and large size classes
+    remain.  (@jasone)
+  - Unify the allocation paths, and merge most fast-path branching decisions.
+    (@davidtgoldblatt, @interwq)
+  - Embed per thread automatic tcache into thread-specific data, which reduces
+    conditional branches and dereferences.  Also reorganize tcache to increase
+    fast-path data locality.  (@interwq)
+  - Rewrite atomics to closely model the C11 API, convert various
+    synchronization from mutex-based to atomic, and use the explicit memory
+    ordering control to resolve various hypothetical races without increasing
+    synchronization overhead.  (@davidtgoldblatt)
+  - Extensively optimize rtree via various methods:
+    + Add multiple layers of rtree lookup caching, since rtree lookups are now
+      part of fast-path deallocation.  (@interwq)
+    + Determine rtree layout at compile time.  (@jasone)
+    + Make the tree shallower for common configurations.  (@jasone)
+    + Embed the root node in the top-level rtree data structure, thus avoiding
+      one level of indirection.  (@jasone)
+    + Further specialize leaf elements as compared to internal node elements,
+      and directly embed extent metadata needed for fast-path deallocation.
+      (@jasone)
+    + Ignore leading always-zero address bits (architecture-specific).
+      (@jasone)
+  - Reorganize headers (ongoing work) to make them hermetic, and disentangle
+    various module dependencies.  (@davidtgoldblatt)
+  - Convert various internal data structures such as size class metadata from
+    boot-time-initialized to compile-time-initialized.  Propagate resulting data
+    structure simplifications, such as making arena metadata fixed-size.
+    (@jasone)
+  - Simplify size class lookups when constrained to size classes that are
+    multiples of the page size.  This speeds lookups, but the primary benefit is
+    complexity reduction in code that was the source of numerous regressions.
+    (@jasone)
+  - Lock individual extents when possible for localized extent operations,
+    rather than relying on a top-level arena lock.  (@davidtgoldblatt, @jasone)
+  - Use first fit layout policy instead of best fit, in order to improve
+    packing.  (@jasone)
+  - If munmap(2) is not in use, use an exponential series to grow each arena's
+    virtual memory, so that the number of disjoint virtual memory mappings
+    remains low.  (@jasone)
+  - Implement per arena base allocators, so that arenas never share any virtual
+    memory pages.  (@jasone)
+  - Automatically generate private symbol name mangling macros.  (@jasone)
+
+  Incompatible changes:
+  - Replace chunk hooks with an expanded/normalized set of extent hooks.
+    (@jasone)
+  - Remove ratio-based purging.  (@jasone)
+  - Remove --disable-tcache.  (@jasone)
+  - Remove --disable-tls.  (@jasone)
+  - Remove --enable-ivsalloc.  (@jasone)
+  - Remove --with-lg-size-class-group.  (@jasone)
+  - Remove --with-lg-tiny-min.  (@jasone)
+  - Remove --disable-cc-silence.  (@jasone)
+  - Remove --enable-code-coverage.  (@jasone)
+  - Remove --disable-munmap (replaced by opt.retain).  (@jasone)
+  - Remove Valgrind support.  (@jasone)
+  - Remove quarantine support.  (@jasone)
+  - Remove redzone support.  (@jasone)
+  - Remove mallctl interfaces (various authors):
+    + config.munmap
+    + config.tcache
+    + config.tls
+    + config.valgrind
+    + opt.lg_chunk
+    + opt.purge
+    + opt.lg_dirty_mult
+    + opt.decay_time
+    + opt.quarantine
+    + opt.redzone
+    + opt.thp
+    + arena.<i>.lg_dirty_mult
+    + arena.<i>.decay_time
+    + arena.<i>.chunk_hooks
+    + arenas.initialized
+    + arenas.lg_dirty_mult
+    + arenas.decay_time
+    + arenas.bin.<i>.run_size
+    + arenas.nlruns
+    + arenas.lrun.<i>.size
+    + arenas.nhchunks
+    + arenas.hchunk.<i>.size
+    + arenas.extend
+    + stats.cactive
+    + stats.arenas.<i>.lg_dirty_mult
+    + stats.arenas.<i>.decay_time
+    + stats.arenas.<i>.metadata.{mapped,allocated}
+    + stats.arenas.<i>.{npurge,nmadvise,purged}
+    + stats.arenas.<i>.huge.{allocated,nmalloc,ndalloc,nrequests}
+    + stats.arenas.<i>.bins.<j>.{nruns,reruns,curruns}
+    + stats.arenas.<i>.lruns.<j>.{nmalloc,ndalloc,nrequests,curruns}
+    + stats.arenas.<i>.hchunks.<j>.{nmalloc,ndalloc,nrequests,curhchunks}
+
+  Bug fixes:
+  - Improve interval-based profile dump triggering to dump only one profile when
+    a single allocation's size exceeds the interval.  (@jasone)
+  - Use prefixed function names (as controlled by --with-jemalloc-prefix) when
+    pruning backtrace frames in jeprof.  (@jasone)
+
+* 4.5.0 (February 28, 2017)
+
+  This is the first release to benefit from much broader continuous integration
+  testing, thanks to @davidtgoldblatt.  Had we had this testing infrastructure
+  in place for prior releases, it would have caught all of the most serious
+  regressions fixed by this release.
+
+  New features:
+  - Add --disable-thp and the opt.thp mallctl to provide opt-out mechanisms for
+    transparent huge page integration.  (@jasone)
+  - Update zone allocator integration to work with macOS 10.12.  (@glandium)
+  - Restructure *CFLAGS configuration, so that CFLAGS behaves typically, and
+    EXTRA_CFLAGS provides a way to specify e.g. -Werror during building, but not
+    during configuration.  (@jasone, @ronawho)
+
+  Bug fixes:
+  - Fix DSS (sbrk(2)-based) allocation.  This regression was first released in
+    4.3.0.  (@jasone)
+  - Handle race in per size class utilization computation.  This functionality
+    was first released in 4.0.0.  (@interwq)
+  - Fix lock order reversal during gdump.  (@jasone)
+  - Fix/refactor tcache synchronization.  This regression was first released in
+    4.0.0.  (@jasone)
+  - Fix various JSON-formatted malloc_stats_print() bugs.  This functionality
+    was first released in 4.3.0.  (@jasone)
+  - Fix huge-aligned allocation.  This regression was first released in 4.4.0.
+    (@jasone)
+  - When transparent huge page integration is enabled, detect what state pages
+    start in according to the kernel's current operating mode, and only convert
+    arena chunks to non-huge during purging if that is not their initial state.
+    This functionality was first released in 4.4.0.  (@jasone)
+  - Fix lg_chunk clamping for the --enable-cache-oblivious --disable-fill case.
+    This regression was first released in 4.0.0.  (@jasone, @428desmo)
+  - Properly detect sparc64 when building for Linux.  (@glaubitz)
+
+* 4.4.0 (December 3, 2016)
+
+  New features:
+  - Add configure support for *-*-linux-android.  (@cferris1000, @jasone)
+  - Add the --disable-syscall configure option, for use on systems that place
+    security-motivated limitations on syscall(2).  (@jasone)
+  - Add support for Debian GNU/kFreeBSD.  (@thesam)
+
+  Optimizations:
+  - Add extent serial numbers and use them where appropriate as a sort key that
+    is higher priority than address, so that the allocation policy prefers older
+    extents.  This tends to improve locality (decrease fragmentation) when
+    memory grows downward.  (@jasone)
+  - Refactor madvise(2) configuration so that MADV_FREE is detected and utilized
+    on Linux 4.5 and newer.  (@jasone)
+  - Mark partially purged arena chunks as non-huge-page.  This improves
+    interaction with Linux's transparent huge page functionality.  (@jasone)
+
+  Bug fixes:
+  - Fix size class computations for edge conditions involving extremely large
+    allocations.  This regression was first released in 4.0.0.  (@jasone,
+    @ingvarha)
+  - Remove overly restrictive assertions related to the cactive statistic.  This
+    regression was first released in 4.1.0.  (@jasone)
+  - Implement a more reliable detection scheme for os_unfair_lock on macOS.
+    (@jszakmeister)
+
+* 4.3.1 (November 7, 2016)
+
+  Bug fixes:
+  - Fix a severe virtual memory leak.  This regression was first released in
+    4.3.0.  (@interwq, @jasone)
+  - Refactor atomic and prng APIs to restore support for 32-bit platforms that
+    use pre-C11 toolchains, e.g. FreeBSD's mips.  (@jasone)
+
+* 4.3.0 (November 4, 2016)
+
+  This is the first release that passes the test suite for multiple Windows
+  configurations, thanks in large part to @glandium setting up continuous
+  integration via AppVeyor (and Travis CI for Linux and OS X).
+
+  New features:
+  - Add "J" (JSON) support to malloc_stats_print().  (@jasone)
+  - Add Cray compiler support.  (@ronawho)
+
+  Optimizations:
+  - Add/use adaptive spinning for bootstrapping and radix tree node
+    initialization.  (@jasone)
+
+  Bug fixes:
+  - Fix large allocation to search starting in the optimal size class heap,
+    which can substantially reduce virtual memory churn and fragmentation.  This
+    regression was first released in 4.0.0.  (@mjp41, @jasone)
+  - Fix stats.arenas.<i>.nthreads accounting.  (@interwq)
+  - Fix and simplify decay-based purging.  (@jasone)
+  - Make DSS (sbrk(2)-related) operations lockless, which resolves potential
+    deadlocks during thread exit.  (@jasone)
+  - Fix over-sized allocation of radix tree leaf nodes.  (@mjp41, @ogaun,
+    @jasone)
+  - Fix over-sized allocation of arena_t (plus associated stats) data
+    structures.  (@jasone, @interwq)
+  - Fix EXTRA_CFLAGS to not affect configuration.  (@jasone)
+  - Fix a Valgrind integration bug.  (@ronawho)
+  - Disallow 0x5a junk filling when running in Valgrind.  (@jasone)
+  - Fix a file descriptor leak on Linux.  This regression was first released in
+    4.2.0.  (@vsarunas, @jasone)
+  - Fix static linking of jemalloc with glibc.  (@djwatson)
+  - Use syscall(2) rather than {open,read,close}(2) during boot on Linux.  This
+    works around other libraries' system call wrappers performing reentrant
+    allocation.  (@kspinka, @Whissi, @jasone)
+  - Fix OS X default zone replacement to work with OS X 10.12.  (@glandium,
+    @jasone)
+  - Fix cached memory management to avoid needless commit/decommit operations
+    during purging, which resolves permanent virtual memory map fragmentation
+    issues on Windows.  (@mjp41, @jasone)
+  - Fix TSD fetches to avoid (recursive) allocation.  This is relevant to
+    non-TLS and Windows configurations.  (@jasone)
+  - Fix malloc_conf overriding to work on Windows.  (@jasone)
+  - Forcibly disable lazy-lock on Windows (was forcibly *enabled*).  (@jasone)
+
+* 4.2.1 (June 8, 2016)
+
+  Bug fixes:
+  - Fix bootstrapping issues for configurations that require allocation during
+    tsd initialization (e.g. --disable-tls).  (@cferris1000, @jasone)
+  - Fix gettimeofday() version of nstime_update().  (@ronawho)
+  - Fix Valgrind regressions in calloc() and chunk_alloc_wrapper().  (@ronawho)
+  - Fix potential VM map fragmentation regression.  (@jasone)
+  - Fix opt_zero-triggered in-place huge reallocation zeroing.  (@jasone)
+  - Fix heap profiling context leaks in reallocation edge cases.  (@jasone)
+
+* 4.2.0 (May 12, 2016)
+
+  New features:
+  - Add the arena.<i>.reset mallctl, which makes it possible to discard all of
+    an arena's allocations in a single operation.  (@jasone)
+  - Add the stats.retained and stats.arenas.<i>.retained statistics.  (@jasone)
+  - Add the --with-version configure option.  (@jasone)
+  - Support --with-lg-page values larger than actual page size.  (@jasone)
+
+  Optimizations:
+  - Use pairing heaps rather than red-black trees for various hot data
+    structures.  (@djwatson, @jasone)
+  - Streamline fast paths of rtree operations.  (@jasone)
+  - Optimize the fast paths of calloc() and [m,d,sd]allocx().  (@jasone)
+  - Decommit unused virtual memory if the OS does not overcommit.  (@jasone)
+  - Specify MAP_NORESERVE on Linux if [heuristic] overcommit is active, in order
+    to avoid unfortunate interactions during fork(2).  (@jasone)
+
+  Bug fixes:
+  - Fix chunk accounting related to triggering gdump profiles.  (@jasone)
+  - Link against librt for clock_gettime(2) if glibc < 2.17.  (@jasone)
+  - Scale leak report summary according to sampling probability.  (@jasone)
+
+* 4.1.1 (May 3, 2016)
+
+  This bugfix release resolves a variety of mostly minor issues, though the
+  bitmap fix is critical for 64-bit Windows.
+
+  Bug fixes:
+  - Fix the linear scan version of bitmap_sfu() to shift by the proper amount
+    even when sizeof(long) is not the same as sizeof(void *), as on 64-bit
+    Windows.  (@jasone)
+  - Fix hashing functions to avoid unaligned memory accesses (and resulting
+    crashes).  This is relevant at least to some ARM-based platforms.
+    (@rkmisra)
+  - Fix fork()-related lock rank ordering reversals.  These reversals were
+    unlikely to cause deadlocks in practice except when heap profiling was
+    enabled and active.  (@jasone)
+  - Fix various chunk leaks in OOM code paths.  (@jasone)
+  - Fix malloc_stats_print() to print opt.narenas correctly.  (@jasone)
+  - Fix MSVC-specific build/test issues.  (@rustyx, @yuslepukhin)
+  - Fix a variety of test failures that were due to test fragility rather than
+    core bugs.  (@jasone)
+
+* 4.1.0 (February 28, 2016)
+
+  This release is primarily about optimizations, but it also incorporates a lot
+  of portability-motivated refactoring and enhancements.  Many people worked on
+  this release, to an extent that even with the omission here of minor changes
+  (see git revision history), and of the people who reported and diagnosed
+  issues, so much of the work was contributed that starting with this release,
+  changes are annotated with author credits to help reflect the collaborative
+  effort involved.
+
+  New features:
+  - Implement decay-based unused dirty page purging, a major optimization with
+    mallctl API impact.  This is an alternative to the existing ratio-based
+    unused dirty page purging, and is intended to eventually become the sole
+    purging mechanism.  New mallctls:
+    + opt.purge
+    + opt.decay_time
+    + arena.<i>.decay
+    + arena.<i>.decay_time
+    + arenas.decay_time
+    + stats.arenas.<i>.decay_time
+    (@jasone, @cevans87)
+  - Add --with-malloc-conf, which makes it possible to embed a default
+    options string during configuration.  This was motivated by the desire to
+    specify --with-malloc-conf=purge:decay , since the default must remain
+    purge:ratio until the 5.0.0 release.  (@jasone)
+  - Add MS Visual Studio 2015 support.  (@rustyx, @yuslepukhin)
+  - Make *allocx() size class overflow behavior defined.  The maximum
+    size class is now less than PTRDIFF_MAX to protect applications against
+    numerical overflow, and all allocation functions are guaranteed to indicate
+    errors rather than potentially crashing if the request size exceeds the
+    maximum size class.  (@jasone)
+  - jeprof:
+    + Add raw heap profile support.  (@jasone)
+    + Add --retain and --exclude for backtrace symbol filtering.  (@jasone)
+
+  Optimizations:
+  - Optimize the fast path to combine various bootstrapping and configuration
+    checks and execute more streamlined code in the common case.  (@interwq)
+  - Use linear scan for small bitmaps (used for small object tracking).  In
+    addition to speeding up bitmap operations on 64-bit systems, this reduces
+    allocator metadata overhead by approximately 0.2%.  (@djwatson)
+  - Separate arena_avail trees, which substantially speeds up run tree
+    operations.  (@djwatson)
+  - Use memoization (boot-time-computed table) for run quantization.  Separate
+    arena_avail trees reduced the importance of this optimization.  (@jasone)
+  - Attempt mmap-based in-place huge reallocation.  This can dramatically speed
+    up incremental huge reallocation.  (@jasone)
+
+  Incompatible changes:
+  - Make opt.narenas unsigned rather than size_t.  (@jasone)
+
+  Bug fixes:
+  - Fix stats.cactive accounting regression.  (@rustyx, @jasone)
+  - Handle unaligned keys in hash().  This caused problems for some ARM systems.
+    (@jasone, @cferris1000)
+  - Refactor arenas array.  In addition to fixing a fork-related deadlock, this
+    makes arena lookups faster and simpler.  (@jasone)
+  - Move retained memory allocation out of the default chunk allocation
+    function, to a location that gets executed even if the application installs
+    a custom chunk allocation function.  This resolves a virtual memory leak.
+    (@buchgr)
+  - Fix a potential tsd cleanup leak.  (@cferris1000, @jasone)
+  - Fix run quantization.  In practice this bug had no impact unless
+    applications requested memory with alignment exceeding one page.
+    (@jasone, @djwatson)
+  - Fix LinuxThreads-specific bootstrapping deadlock.  (Cosmin Paraschiv)
+  - jeprof:
+    + Don't discard curl options if timeout is not defined.  (@djwatson)
+    + Detect failed profile fetches.  (@djwatson)
+  - Fix stats.arenas.<i>.{dss,lg_dirty_mult,decay_time,pactive,pdirty} for
+    --disable-stats case.  (@jasone)
+
+* 4.0.4 (October 24, 2015)
+
+  This bugfix release fixes another xallocx() regression.  No other regressions
+  have come to light in over a month, so this is likely a good starting point
+  for people who prefer to wait for "dot one" releases with all the major issues
+  shaken out.
+
+  Bug fixes:
+  - Fix xallocx(..., MALLOCX_ZERO to zero the last full trailing page of large
+    allocations that have been randomly assigned an offset of 0 when
+    --enable-cache-oblivious configure option is enabled.
+
+* 4.0.3 (September 24, 2015)
+
+  This bugfix release continues the trend of xallocx() and heap profiling fixes.
+
+  Bug fixes:
+  - Fix xallocx(..., MALLOCX_ZERO) to zero all trailing bytes of large
+    allocations when --enable-cache-oblivious configure option is enabled.
+  - Fix xallocx(..., MALLOCX_ZERO) to zero trailing bytes of huge allocations
+    when resizing from/to a size class that is not a multiple of the chunk size.
+  - Fix prof_tctx_dump_iter() to filter out nodes that were created after heap
+    profile dumping started.
+  - Work around a potentially bad thread-specific data initialization
+    interaction with NPTL (glibc's pthreads implementation).
+
+* 4.0.2 (September 21, 2015)
+
+  This bugfix release addresses a few bugs specific to heap profiling.
+
+  Bug fixes:
+  - Fix ixallocx_prof_sample() to never modify nor create sampled small
+    allocations.  xallocx() is in general incapable of moving small allocations,
+    so this fix removes buggy code without loss of generality.
+  - Fix irallocx_prof_sample() to always allocate large regions, even when
+    alignment is non-zero.
+  - Fix prof_alloc_rollback() to read tdata from thread-specific data rather
+    than dereferencing a potentially invalid tctx.
+
+* 4.0.1 (September 15, 2015)
+
+  This is a bugfix release that is somewhat high risk due to the amount of
+  refactoring required to address deep xallocx() problems.  As a side effect of
+  these fixes, xallocx() now tries harder to partially fulfill requests for
+  optional extra space.  Note that a couple of minor heap profiling
+  optimizations are included, but these are better thought of as performance
+  fixes that were integral to disovering most of the other bugs.
+
+  Optimizations:
+  - Avoid a chunk metadata read in arena_prof_tctx_set(), since it is in the
+    fast path when heap profiling is enabled.  Additionally, split a special
+    case out into arena_prof_tctx_reset(), which also avoids chunk metadata
+    reads.
+  - Optimize irallocx_prof() to optimistically update the sampler state.  The
+    prior implementation appears to have been a holdover from when
+    rallocx()/xallocx() functionality was combined as rallocm().
+
+  Bug fixes:
+  - Fix TLS configuration such that it is enabled by default for platforms on
+    which it works correctly.
+  - Fix arenas_cache_cleanup() and arena_get_hard() to handle
+    allocation/deallocation within the application's thread-specific data
+    cleanup functions even after arenas_cache is torn down.
+  - Fix xallocx() bugs related to size+extra exceeding HUGE_MAXCLASS.
+  - Fix chunk purge hook calls for in-place huge shrinking reallocation to
+    specify the old chunk size rather than the new chunk size.  This bug caused
+    no correctness issues for the default chunk purge function, but was
+    visible to custom functions set via the "arena.<i>.chunk_hooks" mallctl.
+  - Fix heap profiling bugs:
+    + Fix heap profiling to distinguish among otherwise identical sample sites
+      with interposed resets (triggered via the "prof.reset" mallctl).  This bug
+      could cause data structure corruption that would most likely result in a
+      segfault.
+    + Fix irealloc_prof() to prof_alloc_rollback() on OOM.
+    + Make one call to prof_active_get_unlocked() per allocation event, and use
+      the result throughout the relevant functions that handle an allocation
+      event.  Also add a missing check in prof_realloc().  These fixes protect
+      allocation events against concurrent prof_active changes.
+    + Fix ixallocx_prof() to pass usize_max and zero to ixallocx_prof_sample()
+      in the correct order.
+    + Fix prof_realloc() to call prof_free_sampled_object() after calling
+      prof_malloc_sample_object().  Prior to this fix, if tctx and old_tctx were
+      the same, the tctx could have been prematurely destroyed.
+  - Fix portability bugs:
+    + Don't bitshift by negative amounts when encoding/decoding run sizes in
+      chunk header maps.  This affected systems with page sizes greater than 8
+      KiB.
+    + Rename index_t to szind_t to avoid an existing type on Solaris.
+    + Add JEMALLOC_CXX_THROW to the memalign() function prototype, in order to
+      match glibc and avoid compilation errors when including both
+      jemalloc/jemalloc.h and malloc.h in C++ code.
+    + Don't assume that /bin/sh is appropriate when running size_classes.sh
+      during configuration.
+    + Consider __sparcv9 a synonym for __sparc64__ when defining LG_QUANTUM.
+    + Link tests to librt if it contains clock_gettime(2).
+
+* 4.0.0 (August 17, 2015)
+
+  This version contains many speed and space optimizations, both minor and
+  major.  The major themes are generalization, unification, and simplification.
+  Although many of these optimizations cause no visible behavior change, their
+  cumulative effect is substantial.
+
+  New features:
+  - Normalize size class spacing to be consistent across the complete size
+    range.  By default there are four size classes per size doubling, but this
+    is now configurable via the --with-lg-size-class-group option.  Also add the
+    --with-lg-page, --with-lg-page-sizes, --with-lg-quantum, and
+    --with-lg-tiny-min options, which can be used to tweak page and size class
+    settings.  Impacts:
+    + Worst case performance for incrementally growing/shrinking reallocation
+      is improved because there are far fewer size classes, and therefore
+      copying happens less often.
+    + Internal fragmentation is limited to 20% for all but the smallest size
+      classes (those less than four times the quantum).  (1B + 4 KiB)
+      and (1B + 4 MiB) previously suffered nearly 50% internal fragmentation.
+    + Chunk fragmentation tends to be lower because there are fewer distinct run
+      sizes to pack.
+  - Add support for explicit tcaches.  The "tcache.create", "tcache.flush", and
+    "tcache.destroy" mallctls control tcache lifetime and flushing, and the
+    MALLOCX_TCACHE(tc) and MALLOCX_TCACHE_NONE flags to the *allocx() API
+    control which tcache is used for each operation.
+  - Implement per thread heap profiling, as well as the ability to
+    enable/disable heap profiling on a per thread basis.  Add the "prof.reset",
+    "prof.lg_sample", "thread.prof.name", "thread.prof.active",
+    "opt.prof_thread_active_init", "prof.thread_active_init", and
+    "thread.prof.active" mallctls.
+  - Add support for per arena application-specified chunk allocators, configured
+    via the "arena.<i>.chunk_hooks" mallctl.
+  - Refactor huge allocation to be managed by arenas, so that arenas now
+    function as general purpose independent allocators.  This is important in
+    the context of user-specified chunk allocators, aside from the scalability
+    benefits.  Related new statistics:
+    + The "stats.arenas.<i>.huge.allocated", "stats.arenas.<i>.huge.nmalloc",
+      "stats.arenas.<i>.huge.ndalloc", and "stats.arenas.<i>.huge.nrequests"
+      mallctls provide high level per arena huge allocation statistics.
+    + The "arenas.nhchunks", "arenas.hchunk.<i>.size",
+      "stats.arenas.<i>.hchunks.<j>.nmalloc",
+      "stats.arenas.<i>.hchunks.<j>.ndalloc",
+      "stats.arenas.<i>.hchunks.<j>.nrequests", and
+      "stats.arenas.<i>.hchunks.<j>.curhchunks" mallctls provide per size class
+      statistics.
+  - Add the 'util' column to malloc_stats_print() output, which reports the
+    proportion of available regions that are currently in use for each small
+    size class.
+  - Add "alloc" and "free" modes for for junk filling (see the "opt.junk"
+    mallctl), so that it is possible to separately enable junk filling for
+    allocation versus deallocation.
+  - Add the jemalloc-config script, which provides information about how
+    jemalloc was configured, and how to integrate it into application builds.
+  - Add metadata statistics, which are accessible via the "stats.metadata",
+    "stats.arenas.<i>.metadata.mapped", and
+    "stats.arenas.<i>.metadata.allocated" mallctls.
+  - Add the "stats.resident" mallctl, which reports the upper limit of
+    physically resident memory mapped by the allocator.
+  - Add per arena control over unused dirty page purging, via the
+    "arenas.lg_dirty_mult", "arena.<i>.lg_dirty_mult", and
+    "stats.arenas.<i>.lg_dirty_mult" mallctls.
+  - Add the "prof.gdump" mallctl, which makes it possible to toggle the gdump
+    feature on/off during program execution.
+  - Add sdallocx(), which implements sized deallocation.  The primary
+    optimization over dallocx() is the removal of a metadata read, which often
+    suffers an L1 cache miss.
+  - Add missing header includes in jemalloc/jemalloc.h, so that applications
+    only have to #include <jemalloc/jemalloc.h>.
+  - Add support for additional platforms:
+    + Bitrig
+    + Cygwin
+    + DragonFlyBSD
+    + iOS
+    + OpenBSD
+    + OpenRISC/or1k
+
+  Optimizations:
+  - Maintain dirty runs in per arena LRUs rather than in per arena trees of
+    dirty-run-containing chunks.  In practice this change significantly reduces
+    dirty page purging volume.
+  - Integrate whole chunks into the unused dirty page purging machinery.  This
+    reduces the cost of repeated huge allocation/deallocation, because it
+    effectively introduces a cache of chunks.
+  - Split the arena chunk map into two separate arrays, in order to increase
+    cache locality for the frequently accessed bits.
+  - Move small run metadata out of runs, into arena chunk headers.  This reduces
+    run fragmentation, smaller runs reduce external fragmentation for small size
+    classes, and packed (less uniformly aligned) metadata layout improves CPU
+    cache set distribution.
+  - Randomly distribute large allocation base pointer alignment relative to page
+    boundaries in order to more uniformly utilize CPU cache sets.  This can be
+    disabled via the --disable-cache-oblivious configure option, and queried via
+    the "config.cache_oblivious" mallctl.
+  - Micro-optimize the fast paths for the public API functions.
+  - Refactor thread-specific data to reside in a single structure.  This assures
+    that only a single TLS read is necessary per call into the public API.
+  - Implement in-place huge allocation growing and shrinking.
+  - Refactor rtree (radix tree for chunk lookups) to be lock-free, and make
+    additional optimizations that reduce maximum lookup depth to one or two
+    levels.  This resolves what was a concurrency bottleneck for per arena huge
+    allocation, because a global data structure is critical for determining
+    which arenas own which huge allocations.
+
+  Incompatible changes:
+  - Replace --enable-cc-silence with --disable-cc-silence to suppress spurious
+    warnings by default.
+  - Assure that the constness of malloc_usable_size()'s return type matches that
+    of the system implementation.
+  - Change the heap profile dump format to support per thread heap profiling,
+    rename pprof to jeprof, and enhance it with the --thread=<n> option.  As a
+    result, the bundled jeprof must now be used rather than the upstream
+    (gperftools) pprof.
+  - Disable "opt.prof_final" by default, in order to avoid atexit(3), which can
+    internally deadlock on some platforms.
+  - Change the "arenas.nlruns" mallctl type from size_t to unsigned.
+  - Replace the "stats.arenas.<i>.bins.<j>.allocated" mallctl with
+    "stats.arenas.<i>.bins.<j>.curregs".
+  - Ignore MALLOC_CONF in set{uid,gid,cap} binaries.
+  - Ignore MALLOCX_ARENA(a) in dallocx(), in favor of using the
+    MALLOCX_TCACHE(tc) and MALLOCX_TCACHE_NONE flags to control tcache usage.
+
+  Removed features:
+  - Remove the *allocm() API, which is superseded by the *allocx() API.
+  - Remove the --enable-dss options, and make dss non-optional on all platforms
+    which support sbrk(2).
+  - Remove the "arenas.purge" mallctl, which was obsoleted by the
+    "arena.<i>.purge" mallctl in 3.1.0.
+  - Remove the unnecessary "opt.valgrind" mallctl; jemalloc automatically
+    detects whether it is running inside Valgrind.
+  - Remove the "stats.huge.allocated", "stats.huge.nmalloc", and
+    "stats.huge.ndalloc" mallctls.
+  - Remove the --enable-mremap option.
+  - Remove the "stats.chunks.current", "stats.chunks.total", and
+    "stats.chunks.high" mallctls.
+
+  Bug fixes:
+  - Fix the cactive statistic to decrease (rather than increase) when active
+    memory decreases.  This regression was first released in 3.5.0.
+  - Fix OOM handling in memalign() and valloc().  A variant of this bug existed
+    in all releases since 2.0.0, which introduced these functions.
+  - Fix an OOM-related regression in arena_tcache_fill_small(), which could
+    cause cache corruption on OOM.  This regression was present in all releases
+    from 2.2.0 through 3.6.0.
+  - Fix size class overflow handling for malloc(), posix_memalign(), memalign(),
+    calloc(), and realloc() when profiling is enabled.
+  - Fix the "arena.<i>.dss" mallctl to return an error if "primary" or
+    "secondary" precedence is specified, but sbrk(2) is not supported.
+  - Fix fallback lg_floor() implementations to handle extremely large inputs.
+  - Ensure the default purgeable zone is after the default zone on OS X.
+  - Fix latent bugs in atomic_*().
+  - Fix the "arena.<i>.dss" mallctl to handle read-only calls.
+  - Fix tls_model configuration to enable the initial-exec model when possible.
+  - Mark malloc_conf as a weak symbol so that the application can override it.
+  - Correctly detect glibc's adaptive pthread mutexes.
+  - Fix the --without-export configure option.
+
 * 3.6.0 (March 31, 2014)

  This version contains a critical bug fix for a regression present in 3.5.0 and
@@ -21,7 +738,7 @@ found in the git revision history:
    backtracing to be reliable.
  - Use dss allocation precedence for huge allocations as well as small/large
    allocations.
-  - Fix test assertion failure message formatting.  This bug did not manifect on
+  - Fix test assertion failure message formatting.  This bug did not manifest on
    x86_64 systems because of implementation subtleties in va_list.
  - Fix inconsequential test failures for hash and SFMT code.

@@ -516,7 +1233,7 @@ found in the git revision history:
  - Make it possible for the application to manually flush a thread's cache, via
    the "tcache.flush" mallctl.
  - Base maximum dirty page count on proportion of active memory.
-  - Compute various addtional run-time statistics, including per size class
+  - Compute various additional run-time statistics, including per size class
    statistics for large objects.
  - Expose malloc_stats_print(), which can be called repeatedly by the
    application.