Experimenting with JFR Mirror Events

Chris Hegarty on June 4, 2021

This writeup describes a number of observations while performing an experiment to determine the feasibility of using JFR mirror events in core parts of the JDK, in particular, the Networking APIs. Mirror events are an internal mechanism used by the JDK that does not depend on the jdk.jfr module, currently the only mirror events are defined in java.base. Since mirror events are only available within the JDK, much of what is discussed within is directly relevant to developers working on the JDK (rather than application or library developers), but certain concerns and techniques may be of interest to the wider audience.

A photo of car mirror
Photo by Hunter James

When implementing events within the JDK, and specifically within the base module, there are currently a number of approaches: 1) a native implementation in the JVM, or 2) bytecode instrumentation at run time, or 3) a mirror event. The first, a native implementation in the JVM, is most appropriate for low-level operations of the JVM itself, e.g. GC statistics. For monitoring the JDK Core libraries the latter two approaches, bytecode instrumentation or a mirror event are more appropriate. But which approach is better, and what are the trade-offs between them? By taking one particular event as an example, this writeup offers some insights and analysis to help answer that question.

Currently (in JDK 17), there are JFR events that allow monitoring socket read and write operations. Namely, jdk.SocketReadEvent and jdk.SocketWriteEvent, respectively. These events are capable of monitoring I/O of both java.net.Socket and NIO SocketChannel. Looking at the source code of the socket or channel implementations there is no reference to JFR or these events, but the code related to the generation of said events is performed through bytecode instrumentation (of the socket and channel implementation classes) at run time, when JFR is enabled. We’ll first generate performance data when recording the existing socket write event, then when reimplemented as a mirror event, and finally we’ll look at areas for potential improvements.

Before digging into the specifics of the socket write event, we should outline the higher-level, more general, considerations relating to the design and implementation of events.

General Considerations

When designing an event (or group of events), amongst the considerations are: 1) the information that we want to expose, 2) the cost of gathering and recording that information, and 3) what type of event to use to convey that information.

How often will an event be raised?
Is the event timed or untimed?
Is the event periodic or per operation?

We should always keep use-cases at the fore of our design. What questions are we trying to help developers answer, what insights into the system do we want to offer, etc.

There are multiple concerns when it comes to analyzing the impact of events on the code in question:

Should the event be written as a mirror event, directly into the source code implementation?
- Maintainability: straightforward to maintain
- Readability: more eyes on the code, clutter?
- Performance: how much performance impact when JFR is not enabled
- Testability: co-located with the regression tests
Should the event use the instrumentation technique, whereby the bytecode is instrumented with the event at run time?
- Maintainability: harder to maintain, more fragile
- Readability: less clutter in the operational code
- Performance: no performance impact when JFR is not enabled
- Testability: requires good test coverage to ensure no breakages

There are multiple concerns when it comes to performance:

Startup performance - how long does it make JFR to initialize and startup
- more Java events longer startup time?
Run time performance when JFR is not enable
Run time performance when JFR is enabled, but the particular event is not enabled
Run time performance when JFR is enabled, and the particular event is enabled

The remainder of this writeup will focus on the latter concerns, relating to run time performance.

Analyzing Socket Write Event

For events in the JDK Core libraries, one of the significant implementation questions is whether to use bytecode instrumentation or a mirror event. We will use a JMH benchmark to help answer this question. The benchmark exercises a single method of SocketChannel, write(ByteBuffer), which can be taken as being representative of other I/O performing methods.

When JFR is enabled, an invocation of the write method will record a SocketWriteEvent. Here is the declaration of this event:

public final class SocketWriteEvent ... {
  @Label("Remote Host")
  public String host;

  @Label("Remote Address")
  public String address;

  @Label("Remote Port")
  public int port;

  @Label("Bytes Written")
  @Description("Number of bytes written to the socket")
  @DataAmount
  public long bytesWritten;
}

The event captures the name, address, and port number of the remote host, along with the number of bytes written. (there are additional captured values not shown here, like the start time and duration)

This event is implemented using the bytecode instrumentation technique which transforms the write method of the SocketChannelImpl class at run time. With a simple benchmark that exchanges a single byte of data with a remote endpoint we can measure the impact that recording this event has on the overall performance of the write operation. We will then rewrite the code so that SocketWriteEvent is implemented using a mirror event, and rerun the benchmark. Comparing before and after, for the following cases:

JFR-disabled
JFR-enabled, SocketWrite event disabled : JFR-enabled-socketWrite-no
JFR-enabled, SocketWrite event enabled : JFR-enabled-socketWrite-yes

There is a fourth case which will not be considered as part of this writeup - the case where JFR is enabled, the particular event is enabled, but the threshold is not reached. When monitoring socket writes on non-blocking channels, it is often the case that you want a threshold of 0, log all events. We’ll return to this fourth case in a separate writeup, as it is more involved.

The benchmark produces average execution time of a single benchmark invocation. Lower scores are better. Times were measured on Linux x86_64, Intel Xeon-8167M 2.00GHz, and the runs use the -prof gc profiler (the results elide some output lines to improve readability). The benchmark is a modified version of SocketChannelReadWrite, which can be found here.

Baseline - JDK 17 EA

The socket write event is implemented using byte code instrumentation.

Benchmark                                                                  (setupJFR)  Mode  Cnt    Score     Error   Units
SocketChannelReadWrite.test                                              JFR-disabled  avgt   50    2.234 ±   0.009   us/op
SocketChannelReadWrite.test:·gc.alloc.rate                               JFR-disabled  avgt   50   ≈ 10⁻⁴            MB/sec
SocketChannelReadWrite.test:·gc.alloc.rate.norm                          JFR-disabled  avgt   50    0.001 ±   0.001    B/op
SocketChannelReadWrite.test                                JFR-enabled-socketWrite-no  avgt   50    2.223 ±   0.006   us/op
SocketChannelReadWrite.test:·gc.alloc.rate                 JFR-enabled-socketWrite-no  avgt   50    0.007 ±   0.004  MB/sec
SocketChannelReadWrite.test:·gc.alloc.rate.norm            JFR-enabled-socketWrite-no  avgt   50    0.024 ±   0.013    B/op
SocketChannelReadWrite.test                               JFR-enabled-socketWrite-yes  avgt   50    3.058 ±   0.046   us/op
SocketChannelReadWrite.test:·gc.alloc.rate                JFR-enabled-socketWrite-yes  avgt   50   68.974 ±   3.317  MB/sec
SocketChannelReadWrite.test:·gc.alloc.rate.norm           JFR-enabled-socketWrite-yes  avgt   50  332.938 ±  20.804    B/op
SocketChannelReadWrite.test:·gc.churn.G1_Eden_Space.norm  JFR-enabled-socketWrite-yes  avgt   50  105.106 ± 286.331    B/op

We can see that the average time per operation for JFR-disabled and JFR-enabled-socketWrite-no is very similar, and that there is little or no allocation - this is largely thanks to the optimizing Hotspot JIT compiler. For JFR-enabled-socketWrite-yes, the per operation time increases (which is expected) and allocation occurs at a normalized rate of 332 bytes per operation (most of which is churn).

Note, the benchmark is not really representative of a real world application, since it writes just a single byte over and over (approximately 4 million times per iteration on my machine). So the per operation cost when the event is enabled should not be taken as a general indication of the overhead of this event, since it is largely amortized when writing larger amounts of data where the majority of time is spent doing I/O. For our analysis purposes, we want to reduce the time spent in I/O so that we can focus on event overhead, hence the use of a single byte.

Using a Mirror Event

In this run the socket write event is reimplemented using a mirror event in the base module (see here for the code changes). The code that determines the actual data captured by the event is largely the same as that of the bytecode instrumented version in the previous run, but in this implementation a mirror event object is allocated for each event.

Benchmark                                                                      (setupJFR)  Mode  Cnt    Score     Error   Units
SocketChannelReadWrite.test                                                  JFR-disabled  avgt   50    2.220 ±   0.014   us/op
SocketChannelReadWrite.test:·gc.alloc.rate                                   JFR-disabled  avgt   50   ≈ 10⁻⁴            MB/sec
SocketChannelReadWrite.test:·gc.alloc.rate.norm                              JFR-disabled  avgt   50    0.001 ±   0.001    B/op
SocketChannelReadWrite.test                                    JFR-enabled-socketWrite-no  avgt   50    2.220 ±   0.010   us/op
SocketChannelReadWrite.test:·gc.alloc.rate                     JFR-enabled-socketWrite-no  avgt   50    0.007 ±   0.004  MB/sec
SocketChannelReadWrite.test:·gc.alloc.rate.norm                JFR-enabled-socketWrite-no  avgt   50    0.025 ±   0.013    B/op
SocketChannelReadWrite.test                                   JFR-enabled-socketWrite-yes  avgt   50    3.123 ±   0.040   us/op
SocketChannelReadWrite.test:·gc.alloc.rate                    JFR-enabled-socketWrite-yes  avgt   50   81.809 ±   5.572  MB/sec
SocketChannelReadWrite.test:·gc.alloc.rate.norm               JFR-enabled-socketWrite-yes  avgt   50  401.742 ±  25.480    B/op
SocketChannelReadWrite.test:·gc.churn.G1_Eden_Space.norm      JFR-enabled-socketWrite-yes  avgt   50  155.162 ± 380.211    B/op

We can see that the average time per and normalized allocation rates per operation for JFR-disabled and JFR-enabled-socketWrite-no is very similar to the previous baseline run. For JFR-enabled-socketWrite-yes, allocation occurs at a normalized rate of 401 bytes per operation, an increase of approximately 20% over the baseline run.

Tweaking a little

To determine the actual data captured by the event, specifically the remote host name and address, a small body of code needs to be run. The code retrieves the host name and address by using the public API and string manipulation - not the most efficient and may be the cause of much of our allocation? (this code is reasonable when constrained to using the public API only). Here it is:

  String hostString  = isa.getAddress().toString();
  int delimiterIndex = hostString.lastIndexOf('/');
  event.host = hostString.substring(0, delimiterIndex);
  event.address = hostString.substring(delimiterIndex + 1);

In the mirror event version, the code resides in the base module alongside the channel implementation, so we can allow ourselves the freedom to look for a better alternative (not using the public API). One option that comes immediately to mind is to extend the Net shared secret to return the hostname string from an InetAddress. The address can be retrieved from InetAddress::getHostAddress(). The following code is functional equivalent:

  InetAddress address = isa.getAddress();
  var h = JNINA.hostnameValue(address);
  event.host = h == null ? "" : h;
  event.address = address.getHostAddress();
  event.port = isa.getPort();

Let’s run the benchmark again with this:

Benchmark                                                                  (setupJFR)  Mode  Cnt    Score    Error   Units
...
SocketChannelReadWrite.test                               JFR-enabled-socketWrite-yes  avgt   50    2.869 ±  0.022   us/op
SocketChannelReadWrite.test:·gc.alloc.rate                JFR-enabled-socketWrite-yes  avgt   50   40.773 ±  0.319  MB/sec
SocketChannelReadWrite.test:·gc.alloc.rate.norm           JFR-enabled-socketWrite-yes  avgt   50  184.139 ±  0.029    B/op
SocketChannelReadWrite.test:·gc.churn.G1_Eden_Space.norm  JFR-enabled-socketWrite-yes  avgt   50   55.492 ± 60.860    B/op

We can see that the per operation time has improved, and allocation has reduced. In fact, over the baseline, the per operation time has only marginally improved, but allocation has been reduced by approximately 45%.

But we can go further. What if we cache the host name and address for a connected channel:

Benchmark                                                                  (setupJFR)  Mode  Cnt   Score    Error   Units
SocketChannelReadWrite.test                                   JFR-enabled-socketWrite-yes  avgt   50   2.718 ±  0.042   us/op
SocketChannelReadWrite.test:·gc.alloc.rate                    JFR-enabled-socketWrite-yes  avgt   50  11.274 ±  0.180  MB/sec
SocketChannelReadWrite.test:·gc.alloc.rate.norm               JFR-enabled-socketWrite-yes  avgt   50  48.221 ±  0.133    B/op
SocketChannelReadWrite.test:·gc.churn.G1_Eden_Space.norm      JFR-enabled-socketWrite-yes  avgt   50  50.237 ± 50.490    B/op

We can see that both the per operation and the allocation overhead have been significantly reduced. A 10% time reduction per operation, and a 85% reduction in allocation, over the baseline. This would seem to indicate that much of the allocation overhead relates to the code that determines the event data values rather than the event (and its infrastructure) itself.

Finally, so as to “compare apples with apples”, here are the results of running a bytecode instrumented socket write event, accessing the cached host name and address fields:

SocketChannelReadWrite.test                      JFR-enabled-socketWrite-yes  avgt   50   2.630 ±  0.015   us/op
SocketChannelReadWrite.test:·gc.alloc.rate       JFR-enabled-socketWrite-yes  avgt   50   0.030 ±  0.007  MB/sec
SocketChannelReadWrite.test:·gc.alloc.rate.norm  JFR-enabled-socketWrite-yes  avgt   50   0.123 ±  0.029    B/op
SocketChannelReadWrite.test:·gc.count            JFR-enabled-socketWrite-yes  avgt   50     ≈ 0           counts

We can see that the per operation time is a little less than the prior mirror event run, and that the allocation has been almost completely eliminated. This is probably as good as it gets, but as we have seen, it’s possible to replace the existing bytecode instrumented socket write event with a mirror event that offers better performance and allocation performance characteristics.

Conclusion

The design of mirror events requires creating an event object each time an event is raised, but in and of itself that should not be the primary concern when designing and implementing an event. Rather, the total performance and allocation impact should be considered as a whole. What is clear from this particular experiment is that how the actual data captured by the event is determined is likely to impact performance more than which mechanism is used to implement the event. Additionally, using a mirror even allows an implementation more freedom in how this determination is coded (more and easier access to private internals).