Mark–Scavenge: Waiting for Trash to Take Itself Out

Jonas Norlinder, Erik Österlund, Tobias Wrigstad David Black-Schaffer (guest) on November 22, 2024

This blog post summarises a new garbage collection algorithm called Mark-Scavenge, which highlights how using reachability as a proxy for liveness in moving GCs leads to unnecessary data movement and how we can address this. This work is from the latest paper within the research collaboration between Oracle and Uppsala University. The full paper can be found on the ACM website.

Modern garbage collection assumes that the weak generational hypothesis holds and that most objects die young [4]. To that end, the heap is partitioned by age into regions. If the weak generational hypothesis holds, most objects in a young region will become garbage at almost any given time. Thus, within the young generation, a garbage collection algorithm that operates on live objects will be more efficient than one that operates on garbage objects, as the number of live objects will be substantially fewer. Consider the example below where the GC needs to reclaim memory.

Figure 1: Reclaiming Memory Through Garbage Collection

Since regions A and C have four live objects and region B has only two, selecting the sparsely populated region B for evacuation is the most efficient, as you only get memory back if you empty the whole region. In this example, we evacuate (move) objects e and f to region D, and after updating the incoming pointers, the entire region B is now free. So far so good, but what does it mean when a GC considers something to be live? According to The Garbage Collection Handbook [1]:

An object is said to be live if it will be accessed at some time in the future execution of the mutator.

Alas, such information about an object’s “true liveness” is only available to those who can predict the future, and for correctness, they must never mispredict an object as dead when it is not. Instead of such perfect predictions, GCs typically approximate liveness using pointer reachability [1]. However, using reachability may result in moving objects to reclaim memory can lead to “wasted work”. The wasted work stems from GC moving objects that while reachable will never be touched again by the system.

In OpenJDK, our object of study, all GCs approximate liveness as reachability. Serial and G1, for example, do this in the context of scavenging. Scavenging means traversing the heap from the roots and relocating reachable objects upon discovery. ZGC uses separate Mark and Evacuate phases to discover and copy objects. An object is marked as live on discovery and later copied if found live during marking.

We set out to study to what extent using reachability as liveness causes the copying of effectively-garbage objects. As far as we know, we are the first to publish a study of this kind. Our initial hypothesis was that collectors based on separate Mark and Evacuate phases would do worse because liveness is computed long before the liveness information is acted upon. This delay, notably absent in scavenging, might cause the liveness-information to become stale. On the other hand, a computed live map, after marking, permits only evacuating sparsely populated regions, meaning that evacuation can be selective, focusing on pages that give a higher return on investment (e.g., copying a handful of objects to make the entire page free). Ultimately, it was only possible to understand the amount of wasted work by running experiments.

Figure 2: Mark-Evacuate | Scavenge Phases

We started measuring wasted work in ZGC because it had the most fitting infrastructure we needed: load-barriers. Without the load-barrier, we could not easily have detected object accesses – i.e., whether the program used the objects eventually. We instrumented ZGC to identify young objects that were reachable during Mark phase but never subsequently accessed before becoming unreachable in the next GC cycle. This identifies GC relocation which is wasted work. We then measured how many of these objects were moved by the GC. We refer to this as wasted work: the movement of actually-dead objects.

We ran the DaCapo benchmark suite [2] and SPECjbb2015 [3] using our modified ZGC and collected information about the true liveness (use) of evacuated objects. Each bar in the 4th group – ZGC (barrier) – in the plot below represents a particular benchmark and measures the percentage of unnecessary objects evacuated because they were never again touched and became unreachable by the next GC. The data shows a very high rate of objects being evacuated unnecessarily due to the pointer reachability over-approximation of liveness and the delay between marking and evacuation.

Figure 3: Benchmarks with Our Modified ZGC

We repeated the G1 and Serial GC experiments with a less precise measurement due to the absence of a load barrier in these GCs. The results are similar, and not dramatically different from the precise load-barrier based measurements (ZGC barrier). While objects are evacuated immediately upon discovery in scavenging, all reachable objects are evacuated, leading to considerable wasted work. Our data shows that despite the time between marking discovery and copying in scavenging being shorter than in Mark–Evacuate, it does not necessarily reduce wasted work. Our conclusion is naively using reachability as a proxy for liveness in the context of a moving GC leads to considerable wasted work.

The more concurrent a GC is, the greater its freedom to schedule when it does work. However, a concurrent GC needs to predict and match the program’s allocation rate because it does not stop the program during garbage collection. However, to withstand misprediction, it also requires a “headroom” – a buffer or available memory – that can accommodate allocation spikes while the GC is working to free up memory. The headroom is typically unused, but it is a necessary safety precaution.

We propose a new GC algorithm to reduce wasted work by delaying evacuation until the next GC cycle to increase the chances of objects becoming unreachable. We call this algorithm Mark–Scavenge. Mark-Scavenge performs reachability discovery similarly to Mark-Evacuate and selects only sparsely populated regions for evacuation. Unlike Mark-Evacuate, there is no immediate evacuation after marking. This means that the only new memory that becomes available after marking are the regions that turned out to be completely empty. Our insight is that the headroom can serve allocations after marking. New headroom can be created on demand from sparsely populated regions using standard evacuation since we have liveness information for these regions. If this does not happen, objects in these regions have had an extra GC cycle to become unreachable, and if that happens, the GC does not need to evacuate those objects as they are dead. If an object is still reachable in the next GC cycle, it is evacuated in a scavenge fashion and relocated upon discovery. Thus, Mark–Scavenge combines the best elements of Scavenging and Mark–Evacuate: the ability to selectively evacuate and no delay between discovery and evacuation after – this is new to Mark–Scavenge – a delay.

Figure 4: Mark-Scavenge - Delay Evacuation Further for Less Wasted Work

We implemented Mark-Scavenge by replacing Mark-Evacuate for the young collection in generational ZGC. The plot below shows how Mark-Scavenge manages to eliminate a lot of the wasted work.

Figure 5: Benchmarks on Wasted Work

The most striking result is (up to) a 91% reduction in the relocation of dead objects. If the GC is run with as much memory as is needed to avoid allocation stalls, the cost of the wasted work is largely hidden. However, if we run on a heavily loaded machine (e.g., the allocation rate is temporarily spiked), the performance benefits become significant: 2-4%. Our best results are from SPECjbb2015, where critical-jOPS (latency) is improved by 2–4% and max-jOPS (throughput) is improved by 2–8%. For a configuration that stresses the machine, the latency for Cassandra is reduced by 20% in the 99th, 99.9th, and 99.99th percentiles.

Our results show that equating liveness with reachability causes significant copying of garbage data in moving collectors. On heavily loaded machines, this negatively impacts performance, as demonstrated by our Mark-Scavenge implementation in ZGC. The insights of this work will enable garbage collection engineers and researchers to improve the state-of-the-art further.

References

[1] Jones et al., The Garbage Collection Handbook (2023)

[2] https://github.com/dacapobench/dacapobench/releases/tag/v23.11-chopin

[3] https://www.spec.org/jbb2015/

[4] Ungar, David. “Generation scavenging: A non-disruptive high performance storage reclamation algorithm”