FFM vs. Unsafe. Safety (Sometimes) Has a Cost
Maurizio Cimadamore, Per-Ake Minborg on June 12, 2025Background
The Foreign Function & Memory API (a.k.a., the FFM API) was finalized in Java 22 and allows interaction with native memory and native functions directly from Java. When interacting with native memory, the FFM API offers a safe API as opposed to access via Unsafe
.
Safety
In the FFM API, native (and heap) memory is modeled by the class MemorySegment
that offers 64-bit addressing and offset as well as structured memory access via various MemoryLayout
classes. Access via a MemorySegment
offers several safety mechanisms, including checking:
- Bounds
- Liveness
- Alignment
- Read-only state
These mechanisms are not offered via Unsafe
memory access. Hence, with Unsafe
, it is possible to:
- Address out of bounds, reading/writing arbitrary memory and/or crashing the VM
- Address memory that has already been freed or has been re-allocated
- Use faulty offsets within an otherwise valid memory region
Furthermore, Unsafe
does not offer read-only views of memory regions.
What is the Performance Cost of Safety
For a “stray” access to a MemorySegment
, we cannot expect performance to be on par with Unsafe
because, inevitably, the mandated checks mentioned above need to be performed. However, the assumption is that, as a segment gets accessed multiple times (e.g., in a loop), with a predictable access pattern, the cost of these checks gets amortized! As it turns out, the JIT complier can hoist these checks outside loops and other similar constructs. This means, for example, we only need to do one check even though we are iterating a hundred or even a thousand times.
The benchmark below can act as an illustration of the benefits of hoisting. Here, an int
element is accessed a single, 10, and 1,000 times in a loop using foreign memory access (“fma”) and Unsafe
and using both reads and writes:
Benchmark Mode Cnt Score Error Units
FMASerDeOffHeap.fmaReadSingle avgt 10 1.482 ± 0.020 ns/op
FMASerDeOffHeap.fmaReadLoop_10 avgt 10 2.978 ± 0.051 ns/op
FMASerDeOffHeap.fmaReadLoop_100 avgt 10 15.385 ± 0.218 ns/op
FMASerDeOffHeap.fmaReadLoop_1000 avgt 10 116.588 ± 3.114 ns/op
FMASerDeOffHeap.fmaWriteSingle avgt 10 1.646 ± 0.024 ns/op
FMASerDeOffHeap.fmaWriteLoop_10 avgt 10 3.289 ± 0.024 ns/op
FMASerDeOffHeap.fmaWriteLoop_100 avgt 10 10.085 ± 0.561 ns/op
FMASerDeOffHeap.fmaWriteLoop_1000 avgt 10 32.705 ± 0.448 ns/op
FMASerDeOffHeap.unsafeReadSingle avgt 10 0.569 ± 0.012 ns/op
FMASerDeOffHeap.unsafeReadLoop_10 avgt 10 1.747 ± 0.023 ns/op
FMASerDeOffHeap.unsafeReadLoop_100 avgt 10 13.087 ± 0.099 ns/op
FMASerDeOffHeap.unsafeReadLoop_1000 avgt 10 117.363 ± 0.081 ns/op
FMASerDeOffHeap.unsafeWriteSingle avgt 10 0.563 ± 0.016 ns/op
FMASerDeOffHeap.unsafeWriteLoop_10 avgt 10 1.169 ± 0.027 ns/op
FMASerDeOffHeap.unsafeWriteLoop_100 avgt 10 6.148 ± 0.528 ns/op
FMASerDeOffHeap.unsafeWriteLoop_1000 avgt 10 30.940 ± 0.147 ns/op
At first glance, it might seem strange that writes are generally faster than reads, but this is a consequence of the benchmark setup, wherein the JIT compiler can use auto vectorization for writes but not for reads. The relative difference between FMA and Unsafe
is still relevant, though.
Moving on, a single read/write using FFM is almost 3x slower than Unsafe
. As we move through the looping variants, the situation does improve, and we see that it takes between 10 and 100 iterations to break even in the read case, while it takes between 100 to 1,000 iterations to break even in the write case. This difference is again caused by auto vectorization: as the code that writes is vectorized, there is less code for the CPU to execute (as multiple elements are written in a single SIMD instruction), which means the “fixed” costs introduced by FFM take longer to amortize.
Can we do better? Well, the problem in the benchmarks is that we’re loading the memory segment from a field memSegment
(see below). As such, the JIT compiler cannot “see” what the segment size will be and use that information to eliminate bound checks in the compiled code.
@Benchmark
public void fmaReadLoop_1000(Blackhole blackhole) {
for (int i = 0 ; i < 4000 ; i+=4) {
blackhole.consume(memSegment.get(ValueLayout.JAVA_INT_UNALIGNED, i));
}
}
But what if we created a memory segment “on the fly” based on the unsafe (or original segment) address? This trick has been discussed in the past as well – something like this:
@Benchmark
public void fmaReadLoop_1000(Blackhole blackhole) {
MemorySegment memSegment = MemorySegment.ofAddress(bufferUnsafe).reinterpret(4000);
for (int i = 0 ; i < 4000 ; i+=4) {
blackhole.consume(memSegment.get(ValueLayout.JAVA_INT_UNALIGNED, i));
}
}
On the surface, this looks the same as before (we still have to execute all the checks!). But, there’s a crucial difference: the JIT compiler can now see that memSegment
will always be backed by the global arena (because of MemorySegment::ofAddress
), and its size will always be 4,000 (because of MemorySegment::reinterpret
). You can further use this piece of information to eliminate the cost of some of the checks, as demonstrated by below benchmark:
Benchmark Mode Cnt Score Error Units
FMASerDeOffHeapReinterpret.fmaReadSingle avgt 10 0.588 ± 0.016 ns/op
FMASerDeOffHeapReinterpret.fmaReadLoop_10 avgt 10 1.762 ± 0.025 ns/op
FMASerDeOffHeapReinterpret.fmaReadLoop_100 avgt 10 13.370 ± 0.028 ns/op
FMASerDeOffHeapReinterpret.fmaReadLoop_1000 avgt 10 124.499 ± 1.051 ns/op
FMASerDeOffHeapReinterpret.fmaWriteSingle avgt 10 0.548 ± 0.002 ns/op
FMASerDeOffHeapReinterpret.fmaWriteLoop_10 avgt 10 1.180 ± 0.010 ns/op
FMASerDeOffHeapReinterpret.fmaWriteLoop_100 avgt 10 6.278 ± 0.301 ns/op
FMASerDeOffHeapReinterpret.fmaWriteLoop_1000 avgt 10 38.298 ± 0.792 ns/op
FMASerDeOffHeapReinterpret.unsafeReadSingle avgt 10 0.564 ± 0.005 ns/op
FMASerDeOffHeapReinterpret.unsafeReadLoop_10 avgt 10 1.661 ± 0.013 ns/op
FMASerDeOffHeapReinterpret.unsafeReadLoop_100 avgt 10 12.514 ± 0.023 ns/op
FMASerDeOffHeapReinterpret.unsafeReadLoop_1000 avgt 10 115.906 ± 4.542 ns/op
FMASerDeOffHeapReinterpret.unsafeWriteSingle avgt 10 0.577 ± 0.005 ns/op
FMASerDeOffHeapReinterpret.unsafeWriteLoop_10 avgt 10 1.114 ± 0.003 ns/op
FMASerDeOffHeapReinterpret.unsafeWriteLoop_100 avgt 10 6.028 ± 0.140 ns/op
FMASerDeOffHeapReinterpret.unsafeWriteLoop_1000 avgt 10 30.631 ± 0.928 ns/op
As you can observe, the FFM version and the Unsafe
version are much closer to each other (again, there are still some hiccups when it comes to read vs. write).
Is It Worth It?
Removing the safety belt and the roll cage from a racing car would undoubtedly make it go slightly faster, but this is almost never done and is forbidden by the rules for all major racing classes. We should think about software development in the same way. We should almost never use Unsafe
to cream out the last few percent of performance while sacrificing safety.
So, where does this leave the whole FFM vs. Unsafe
comparison? Unfortunately, it’s a bit hard to do a straight comparison here because, in a way, we’re comparing pears with apples. Unsafe
(by virtue of being… unsafe) just accesses memory and does not perform any additional checks. FFM, on the contrary, is an API designed around safe memory access - this safety has a cost, especially in the stray access case.
While the OpenJDK community will keep working towards improving the performance of FFM as much as possible, it is unrealistic to expect that the stray access case will be on par with Unsafe
. That said, in realistic use cases, this has rarely been an issue. Real code off-heap memory access typically comes in two different shades. There are cases where there is a stray access that is surrounded by a lot of other code. And then there are cases, like Apache Lucene, where the same segment is accessed in loops over and over (sometimes even using the vector API). Optimizing the first case is not too interesting – in such cases, the performance of a single memory access is often irrelevant. On the other hand, optimizing the second case is very important - and the benchmarks above show that, as you keep looping over the same segment, FFM quickly reaches parity with Unsafe
(we would, of course, love to reduce the “break-even point” over time).
There are, of course, pathological cases where the access pattern is not predictable and cannot be speculated upon (think of an off-heap binary search or something like that). In such cases, the additional cost of the checks may begin to accumulate. For these situations, tricks like the one shown above (using reinterpret
) might be very useful to get back to a performance profile that is closer to Unsafe
. But you should reach for those tricks sparingly - it is likely that, in most cases, no such trick is needed - because either the performance of memory access is not critical enough or because access occurs in a loop that the JIT compiler can already optimize well.
The ability to auto vectorize access under non-predictable access patterns (like binary searches) might more than recuperate the cost of using the FFM API over Unsafe
. Therefore, such algorithms should be carefully crafted and evaluated to be both faster and safer than more naïve algorithms using Unsafe
. There are schemes to significantly improve performance while still using safe access, for example, using “branch-less binary search algorithms”. But more on that in another post.
What’s the Next Step?
If you use Unsafe
for memory access and are running on a pre-22-JDK release, download JDK 25 and see how safety can be improved with your current applications by migrating to the FFM API.