FFM vs. Unsafe. Safety (Sometimes) Has a Cost

Background

The Foreign Function & Memory API (a.k.a., the FFM API) was finalized in Java 22 and allows interaction with native memory and native functions directly from Java. When interacting with native memory, the FFM API offers a safe API as opposed to access via Unsafe.

Safety

In the FFM API, native (and heap) memory is modeled by the class MemorySegment that offers 64-bit addressing and offset as well as structured memory access via various MemoryLayout classes. Access via a MemorySegment offers several safety mechanisms, including checking:

  • Bounds
  • Liveness
  • Alignment
  • Read-only state

These mechanisms are not offered via Unsafe memory access. Hence, with Unsafe, it is possible to:

  • Address out of bounds, reading/writing arbitrary memory and/or crashing the VM
  • Address memory that has already been freed or has been re-allocated
  • Use faulty offsets within an otherwise valid memory region

Furthermore, Unsafe does not offer read-only views of memory regions.

What is the Performance Cost of Safety

For a “stray” access to a MemorySegment, we cannot expect performance to be on par with Unsafe because, inevitably, the mandated checks mentioned above need to be performed. However, the assumption is that, as a segment gets accessed multiple times (e.g., in a loop), with a predictable access pattern, the cost of these checks gets amortized! As it turns out, the JIT complier can hoist these checks outside loops and other similar constructs. This means, for example, we only need to do one check even though we are iterating a hundred or even a thousand times.

The benchmark below can act as an illustration of the benefits of hoisting. Here, an int element is accessed a single, 10, and 1,000 times in a loop using foreign memory access (“fma”) and Unsafe and using both reads and writes:

Benchmark                                        Mode  Cnt    Score   Error  Units
FMASerDeOffHeap.fmaReadSingle                    avgt   10    1.482 ± 0.020  ns/op
FMASerDeOffHeap.fmaReadLoop_10                   avgt   10    2.978 ± 0.051  ns/op
FMASerDeOffHeap.fmaReadLoop_100                  avgt   10   15.385 ± 0.218  ns/op
FMASerDeOffHeap.fmaReadLoop_1000                 avgt   10  116.588 ± 3.114  ns/op

FMASerDeOffHeap.fmaWriteSingle                   avgt   10    1.646 ± 0.024  ns/op
FMASerDeOffHeap.fmaWriteLoop_10                  avgt   10    3.289 ± 0.024  ns/op
FMASerDeOffHeap.fmaWriteLoop_100                 avgt   10   10.085 ± 0.561  ns/op
FMASerDeOffHeap.fmaWriteLoop_1000                avgt   10   32.705 ± 0.448  ns/op


FMASerDeOffHeap.unsafeReadSingle                 avgt   10    0.569 ± 0.012  ns/op
FMASerDeOffHeap.unsafeReadLoop_10                avgt   10    1.747 ± 0.023  ns/op
FMASerDeOffHeap.unsafeReadLoop_100               avgt   10   13.087 ± 0.099  ns/op
FMASerDeOffHeap.unsafeReadLoop_1000              avgt   10  117.363 ± 0.081  ns/op

FMASerDeOffHeap.unsafeWriteSingle                avgt   10    0.563 ± 0.016  ns/op
FMASerDeOffHeap.unsafeWriteLoop_10               avgt   10    1.169 ± 0.027  ns/op
FMASerDeOffHeap.unsafeWriteLoop_100              avgt   10    6.148 ± 0.528  ns/op
FMASerDeOffHeap.unsafeWriteLoop_1000             avgt   10   30.940 ± 0.147  ns/op

At first glance, it might seem strange that writes are generally faster than reads, but this is a consequence of the benchmark setup, wherein the JIT compiler can use auto vectorization for writes but not for reads. The relative difference between FMA and Unsafe is still relevant, though.

Moving on, a single read/write using FFM is almost 3x slower than Unsafe. As we move through the looping variants, the situation does improve, and we see that it takes between 10 and 100 iterations to break even in the read case, while it takes between 100 to 1,000 iterations to break even in the write case. This difference is again caused by auto vectorization: as the code that writes is vectorized, there is less code for the CPU to execute (as multiple elements are written in a single SIMD instruction), which means the “fixed” costs introduced by FFM take longer to amortize.

Can we do better? Well, the problem in the benchmarks is that we’re loading the memory segment from a field memSegment (see below). As such, the JIT compiler cannot “see” what the segment size will be and use that information to eliminate bound checks in the compiled code.

@Benchmark
public void fmaReadLoop_1000(Blackhole blackhole) {
    for (int i = 0 ; i < 4000 ; i+=4) {
        blackhole.consume(memSegment.get(ValueLayout.JAVA_INT_UNALIGNED, i));
    }
}

But what if we created a memory segment “on the fly” based on the unsafe (or original segment) address? This trick has been discussed in the past as well – something like this:

@Benchmark
public void fmaReadLoop_1000(Blackhole blackhole) {
    MemorySegment memSegment = MemorySegment.ofAddress(bufferUnsafe).reinterpret(4000);
    for (int i = 0 ; i < 4000 ; i+=4) {
        blackhole.consume(memSegment.get(ValueLayout.JAVA_INT_UNALIGNED, i));
    }
}

On the surface, this looks the same as before (we still have to execute all the checks!). But, there’s a crucial difference: the JIT compiler can now see that memSegment will always be backed by the global arena (because of MemorySegment::ofAddress), and its size will always be 4,000 (because of MemorySegment::reinterpret). You can further use this piece of information to eliminate the cost of some of the checks, as demonstrated by below benchmark:

Benchmark                                        Mode  Cnt    Score   Error  Units
FMASerDeOffHeapReinterpret.fmaReadSingle         avgt   10    0.588 ± 0.016  ns/op
FMASerDeOffHeapReinterpret.fmaReadLoop_10        avgt   10    1.762 ± 0.025  ns/op
FMASerDeOffHeapReinterpret.fmaReadLoop_100       avgt   10   13.370 ± 0.028  ns/op
FMASerDeOffHeapReinterpret.fmaReadLoop_1000      avgt   10  124.499 ± 1.051  ns/op

FMASerDeOffHeapReinterpret.fmaWriteSingle        avgt   10    0.548 ± 0.002  ns/op
FMASerDeOffHeapReinterpret.fmaWriteLoop_10       avgt   10    1.180 ± 0.010  ns/op
FMASerDeOffHeapReinterpret.fmaWriteLoop_100      avgt   10    6.278 ± 0.301  ns/op
FMASerDeOffHeapReinterpret.fmaWriteLoop_1000     avgt   10   38.298 ± 0.792  ns/op


FMASerDeOffHeapReinterpret.unsafeReadSingle      avgt   10    0.564 ± 0.005  ns/op
FMASerDeOffHeapReinterpret.unsafeReadLoop_10     avgt   10    1.661 ± 0.013  ns/op
FMASerDeOffHeapReinterpret.unsafeReadLoop_100    avgt   10   12.514 ± 0.023  ns/op
FMASerDeOffHeapReinterpret.unsafeReadLoop_1000   avgt   10  115.906 ± 4.542  ns/op

FMASerDeOffHeapReinterpret.unsafeWriteSingle     avgt   10    0.577 ± 0.005  ns/op
FMASerDeOffHeapReinterpret.unsafeWriteLoop_10    avgt   10    1.114 ± 0.003  ns/op
FMASerDeOffHeapReinterpret.unsafeWriteLoop_100   avgt   10    6.028 ± 0.140  ns/op
FMASerDeOffHeapReinterpret.unsafeWriteLoop_1000  avgt   10   30.631 ± 0.928  ns/op

As you can observe, the FFM version and the Unsafe version are much closer to each other (again, there are still some hiccups when it comes to read vs. write).

Is It Worth It?

Removing the safety belt and the roll cage from a racing car would undoubtedly make it go slightly faster, but this is almost never done and is forbidden by the rules for all major racing classes. We should think about software development in the same way. We should almost never use Unsafe to cream out the last few percent of performance while sacrificing safety.

So, where does this leave the whole FFM vs. Unsafe comparison? Unfortunately, it’s a bit hard to do a straight comparison here because, in a way, we’re comparing pears with apples. Unsafe (by virtue of being… unsafe) just accesses memory and does not perform any additional checks. FFM, on the contrary, is an API designed around safe memory access - this safety has a cost, especially in the stray access case.

While the OpenJDK community will keep working towards improving the performance of FFM as much as possible, it is unrealistic to expect that the stray access case will be on par with Unsafe. That said, in realistic use cases, this has rarely been an issue. Real code off-heap memory access typically comes in two different shades. There are cases where there is a stray access that is surrounded by a lot of other code. And then there are cases, like Apache Lucene, where the same segment is accessed in loops over and over (sometimes even using the vector API). Optimizing the first case is not too interesting – in such cases, the performance of a single memory access is often irrelevant. On the other hand, optimizing the second case is very important - and the benchmarks above show that, as you keep looping over the same segment, FFM quickly reaches parity with Unsafe (we would, of course, love to reduce the “break-even point” over time).

There are, of course, pathological cases where the access pattern is not predictable and cannot be speculated upon (think of an off-heap binary search or something like that). In such cases, the additional cost of the checks may begin to accumulate. For these situations, tricks like the one shown above (using reinterpret) might be very useful to get back to a performance profile that is closer to Unsafe. But you should reach for those tricks sparingly - it is likely that, in most cases, no such trick is needed - because either the performance of memory access is not critical enough or because access occurs in a loop that the JIT compiler can already optimize well.

The ability to auto vectorize access under non-predictable access patterns (like binary searches) might more than recuperate the cost of using the FFM API over Unsafe. Therefore, such algorithms should be carefully crafted and evaluated to be both faster and safer than more naïve algorithms using Unsafe. There are schemes to significantly improve performance while still using safe access, for example, using “branch-less binary search algorithms”. But more on that in another post.

What’s the Next Step?

If you use Unsafe for memory access and are running on a pre-22-JDK release, download JDK 25 and see how safety can be improved with your current applications by migrating to the FFM API.