JEP 400 and the Default Charset

Naoto Sato on October 4, 2021

TL;DR: Starting JDK 18, UTF-8 is the default charset across platforms. Make sure to test your applications, especially if you are running them on Windows.

Close-up of acient characters
Photo by Raphael Schaller

Have you ever wondered about the “default charset”? Here’s what the Charset.defaultCharset javadoc says:

The default charset is determined during virtual-machine startup and typically depends upon the locale and charset of the underlying operating system.

The phrase “depends upon the locale and charset of the underlying operating system” sounds a little too vague. Why is it so? When Java was launched +25 years ago, there was no such thing as a default charset. At that time, the Java Language Specification adopting Unicode as the basis of the java.lang.Character class was a brilliant choice. Fast-forward to today, Unicode is now more common. Today, UTF-8 encoding dominates almost everywhere, especially on the web, where more than 95% of the content is encoded using UTF-8 (cf. Usage of character encodings broken down by ranking).

The UTF-8 Wikipedia page confirms that growth over the years.

Newer programming languages (ex. Go, Rust) adopted UTF-8 as the default text encoding. In Java, the method Charset.defaultCharset() returning arbitrary charset depending on the underlying OS/user’s environment has often been pointed out as a technical debt upon users’ shoulders. New developers should not have to deal with that historical debt.

Looking at it from a different perspective, i.e. “Where is the default charset used?” The most typical use is probably the implicit decoder of the java.io.InputStreamReader class. Take a look at java.io.FileReader which is a subclass of InputStreamReader. Suppose a Japanese text file encoded in UTF-8 is read by a FileReader instance, created without specifying an explicit charset:

java.io.FileReader("test.txt") ➜ "こんにちは" (macOS) java.io.FileReader("test.txt") ➜ "ã?“ã‚“ã?«ã?¡ã? ̄" (Windows (en-US))

Here, the problem is apparent. On macOS, the default encoding used by the underlying operating system is UTF-8, thus the file content is read (decoded) correctly. On the other hand, if you read the same text file on Windows (US), the content is garbled. This is because the FileReader object reads the text content with the code page 1252 encoding, which is the default encoding used in Windows with the system locale English (United States). Even with the same operating system, the result may differ depending on the user’s settings. If the user of that Windows host changes the system locale to Japanese (Japan), then he/she would get:

java.io.FileReader("test.txt") ➜ "縺薙s縺ォ縺。縺ッ" (Windows (ja-JP))

Something’s got to give!

Making UTF-8 the Default Charset

To address this long-standing problem, JEP 400 is changing the default charset to UTF-8 in JDK 18. This in fact aligns with the existing newBufferedReader/Writer methods of the java.nio.file.Files class where UTF-8 is the default when no explicit charset is set.

jshell> Files.newBufferedReader(Path.of("test.txt")).readLine()
$1 ==> "こんにちは"

The above example shows that a UTF-8 encoded text file can be read, starting JDK 17, with java.nio.file.Files methods, regardless of the host and/or user’s settings.

By making UTF-8 the default charset, the JDK I/O APIs will now always work in the same, predictable manner, with no need to pay attention to the host and or user’s environment! Applications that needed a consistent behavior used to specify the unsupported file.encoding system property. This is not longer ncessary!

jshell> new BufferedReader(new FileReader("test.txt")).readLine()
$2 ==> "こんにちは"

The above example demonstrates that FileReader class can now consistently work with the newer Files methods, regardless of the host and/or user’s settings in JDK 18.

There’s one consideration that needs to be addressed. That is, System.out/err which are directly connected to the underlying stdout/err that follows the underlying host and/or user’s environment. If we changed that encoding to UTF-8, any output to System.out/err would immediately be affected and would be garbled in some environments (ex. Windows). For that reason, the encoding used in those I/O remains intact, which is equivalent to java.io.Console.charset() that was introduced in JDK 17.

Compatibility & Mitigation Strategies

Changing the default charset to UTF-8 is the right thing to do (and was long overdue too) but it does introduce some incompatible issues, especially for applications that are only deployed on Windows. We understand some users do expect the prior behavior, where the default charset was depending on the host and user’s environment. In order for those applications to work consistently, we have provided two mitigations as follows:

1. Source Code Recompilation

If you have the ability to recompile the source code, then change the affected code to explicitly specify the charset. For example in the above example, replace those no-charset constructors with ones with explicit charset, e.g., java.io.FileReader("test.txt", "UTF-8"). By doing this, the behavior will be uniform. If you do NOT know the charset but still want the prior behavior, use the native.encoding system property introduced in JDK 17. For example, on Windows in English (United States) system default locale:

jshell> System.getProperty("native.encoding")
$3 ==> "Cp1252"

Thus you need to specify Cp1252 to the FileReader constructor. The modification would look like this:

String encoding = System.getProperty("native.encoding"); // Populated on Java 18 and later
Charset cs = (encoding != null) ? Charset.forName(encoding) : Charset.defaultCharset();
var reader = new FileReader("file.txt", cs);

Speaking of compiling, the javac command also depends on the default charset. Thus you need to know what encoding the source files were saved, which may or may not be UTF-8, and specify it with javac’s -encoding option.

2. No Recompilation

In JDK 18, file.encoding has become a supported system property (i.e., described in the javadoc and supported). The value to that system property is either UTF-8 or COMPAT, or else the behavior is undefined. If the application is launched with the -Dfile.encoding=COMPAT command line option, the default encoding will then be determined the way it used to be in prior JDK releases, which preserves the compatibility.

Preparing for JEP 400 - Call to Action

Since JEP 400 is somewhat a disruptive enhancement, we urge you to test your applications with the existing environment. The exact effect of this JEP can be easily reproduced with previously released JDKs back to JDK 8, by using the file.encoding system property. So try running your application with the -Dfile.encoding=UTF-8 command-line option and see how it behave. Our expectation is that there won’t be any issues on macOS and Linux, as their default encoding are already UTF-8. On Windows, especially for East Asian locales such as Chinese/Japanese/Korean, some incompatible behavior could be anticipated. If that’s the case, please try the mitigations strategies explained above.

And of course, you can also try out JEP 400 with a JDK 18 Early Access build (JEP 400 has been integrated in build 13) which can be downloaded at https://jdk.java.net/18/.

Wrap-up

We were wondering about the reception of JEP 400 as it is a long overdue but disruptive enhancement. When the JEP was promoted to the “Candidate” state, we received lots of external feedback and it turned out that most of that feedback was very positive! This reinforces the direction taken for this enhancement. We are sure that in the long run it will get forgotten by developers as it gets so commoditized.