1. Overview
When dealing with Strings in Java, we sometimes need to encode them into a specific charset.
Further reading:
Guide to Character Encoding
Guide to Java URL Encoding/Decoding
Java Base64 Encoding and Decoding
This tutorial is a practical guide showing different ways to encode a String to the UTF-8 charset.
For a more technical deep-dive, see our Guide to Character Encoding.
2. Defining the Problem
To showcase the Java encoding, weβll work with the German String βEntwickeln Sie mit VergnΓΌgenβ:
String germanString = "Entwickeln Sie mit VergnΓΌgen";
byte[] germanBytes = germanString.getBytes();
String asciiEncodedString = new String(germanBytes, StandardCharsets.US_ASCII);
assertNotEquals(asciiEncodedString, germanString);
This String encoded using US_ASCII gives us the value βEntwickeln Sie mit Vergn?genβ when printed because it doesnβt understand the non-ASCII ΓΌ character.
But when we convert an ASCII-encoded String that uses all English characters to UTF-8, we get the same string:
String englishString = "Develop with pleasure";
byte[] englishBytes = englishString.getBytes();
String asciiEncondedEnglishString = new String(englishBytes, StandardCharsets.US_ASCII);
assertEquals(asciiEncondedEnglishString, englishString);
Letβs see what happens when we use the UTF-8 encoding.
3. Encoding With Core Java
Letβs start with the core library.
Strings are immutable in Java, which means we cannot change a String character encoding. To achieve what we want, we need to copy the bytes of the String and then create a new one with the desired encoding.
First, we get the String bytes, and then we create a new one using the retrieved bytes and the desired charset:
String rawString = "Entwickeln Sie mit VergnΓΌgen";
byte[] bytes = rawString.getBytes(StandardCharsets.UTF_8);
String utf8EncodedString = new String(bytes, StandardCharsets.UTF_8);
assertEquals(rawString, utf8EncodedString);
4. Encoding With Java 7 StandardCharsets
Alternatively, we can use the StandardCharsets class introduced in Java 7 to encode the String.
First, weβll encode the String into bytes, and second, weβll decode it into a UTF-8 String:
String rawString = "Entwickeln Sie mit VergnΓΌgen";
ByteBuffer buffer = StandardCharsets.UTF_8.encode(rawString);
String utf8EncodedString = StandardCharsets.UTF_8.decode(buffer).toString();
assertEquals(rawString, utf8EncodedString);
5. Encoding With Commons-Codec
Besides using core Java, we can alternatively use Apache Commons Codec to achieve the same results.
Apache Commons Codec is a handy package containing simple encoders and decoders for various formats.
First, letβs start with the project configuration.
When using Maven, we have to add the commons-codec dependency to our pom.xml:
<dependency>
<groupId>commons-codec</groupId>
<artifactId>commons-codec</artifactId>
<version>1.14</version>
</dependency>
Then, in our case, the most interesting class is StringUtils, which provides methods to encode Strings.
Using this class, getting a UTF-8 encoded String is pretty straightforward:
String rawString = "Entwickeln Sie mit VergnΓΌgen";
byte[] bytes = StringUtils.getBytesUtf8(rawString);
String utf8EncodedString = StringUtils.newStringUtf8(bytes);
assertEquals(rawString, utf8EncodedString);
6. Conclusion
Encoding a String into UTF-8 isnβt difficult, but itβs not that intuitive. This article presents three ways of doing it, using either core Java or Apache Commons Codec.
