Voozh

A granular look at what the JVM is quietly doing with your strings at the native level, when that work genuinely saves you memory, and when it simply burns CPU for nothing.

1. Wait — It’s Already Running?

If you are running Java 8u20 or any later version with the G1 garbage collector, there is a JVM subsystem called String Deduplication that may already be active right now in your production services. You did not turn it on explicitly. Depending on your JVM flags, it was either enabled for you or is one flag away from being so.

According to the OpenJDK JEP 192, which introduced this feature, the motivation was simple: strings make up a significant portion of Java heap usage — often somewhere between 25% and 30% in typical applications — and many of those strings are exact duplicates. The JVM team decided it was worth doing something about that automatically.

String Deduplication was introduced in Java 8 Update 20 (August 2014) and applies exclusively to the G1 garbage collector. It is not available with ZGC, Shenandoah, or the old CMS and Parallel collectors.

To enable it explicitly — or to check whether it is already active — you can use these flags:

# Enable String Deduplication with G1
java -XX:+UseG1GC -XX:+UseStringDeduplication -XX:+PrintStringDeduplicationStatistics -jar yourapp.jar

# Verify it is running on an already-started process (requires JDK tools)
jcmd <pid> VM.flags | grep StringDedup

Those flags print deduplication statistics to standard output at JVM exit. Now, before we dive into whether those statistics will make you happy or worried, let us understand exactly what is happening under the hood.

2. What Actually Happens at the Native Level

A Java String object has two parts: the object itself (a thin shell with metadata) and an underlying char[] or byte[] array that holds the actual characters. As of Java 9 and the Compact Strings change (JEP 254), most Latin-1 strings are stored as a byte[], which already halves the raw memory footprint versus the old UTF-16 storage. String Deduplication takes a different angle entirely.

Rather than compressing characters, it looks for separate string objects that happen to hold identical character sequences and then makes them all share a single backing array. The string objects themselves remain distinct — your == comparisons are unaffected — but their internal value fields are silently rewired to point at the same heap array. Only one copy of those bytes survives. The rest get garbage collected.

This is not the same as String.intern(). Interning makes the String objects themselves identical (same reference). Deduplication keeps distinct String objects while merging only their internal byte[] storage. Your object identity is preserved; only the backing memory is shared.

The GC-Integrated Pipeline

The deduplication process is tightly coupled to the G1 collection cycle, and understanding that coupling is key to understanding the performance cost. Here is how it flows, step by step:

#	Step	Where it runs	Cost type
1	String objects that survive a GC are added to a deduplication queue	GC thread (inline)	Tiny allocation overhead
2	A background deduplication thread drains the queue concurrently	Background (off GC thread)	CPU contention with app
3	Thread computes a hash of each string’s `byte[]`	Background thread	Memory bandwidth
4	Hash is looked up in an internal string dedup table	Background thread	Hash table overhead
5	On match, content is compared byte-by-byte to confirm equality	Background thread	Memory read, potential cache miss
6	The `value` field of the duplicate is CAS-swapped to the canonical array	Background thread	Write barrier, minimal
7	Old duplicate array is now unreachable, freed on next GC cycle	GC thread	Standard collection

Notice that steps 2 through 6 run concurrently — they do not stop your application. However, “concurrent” does not mean “free.” The background thread still shares CPU cores and memory bandwidth with your application threads. On a constrained container with, say, 2 vCPUs, that cost is very real. More on that in the trade-off section below.

3. When It Genuinely Helps: The Microservice Case

String Deduplication shines brightest in a specific type of workload. If you are running microservices that repeatedly deserialize structured data — think JSON APIs, Kafka consumers, gRPC services — you are almost certainly generating thousands of string objects per second that are textually identical.

Consider a user-profile service that reads from Kafka. Every message might contain fields like "country": "DE", "currency": "EUR", "role": "viewer". With ten thousand messages per second, by the time those strings survive their first young-gen GC they exist as thousands of separate byte[] arrays, each spelling out the same seven characters. Deduplication collapses all of them to a single array.

Heap Memory: Before vs After String Deduplication

👁 Image

Simulated microservice processing 10,000 JSON messages/sec with repeated enum-like fields. Values in MB over a 60-second observation window.

The savings in this category of workload are not trivial. The original JEP benchmark showed heap reductions of up to 10% in real-world applications, with some string-heavy services seeing considerably more. Furthermore, because the heap stays smaller, GC pauses become shorter and less frequent — a secondary benefit that compounds the first.

JSON/XML parsing services · Kafka consumers · REST APIs with repeated domain values · Database result-set processors · Any service where the same string values recur across many objects.

4. When It Is a CPU Trade-Off Not Worth Making

Here is the part that most articles skip. String Deduplication is not universally beneficial, and blindly leaving it on is not always the right call. There are at least three scenarios where the cost quietly outweighs the benefit.

Scenario 1 — Short-lived strings that never survive GC

The deduplication queue only receives strings that survive at least one GC cycle. If most of your strings are request-scoped and die in the young generation — which is the ideal situation for GC performance — they will never be deduplicated at all. The hashing and table-lookup cost is zero, but so are the savings. Deduplication neither helps nor hurts here; it is simply neutral.

Scenario 2 — Unique-content strings (UUIDs, timestamps, log lines)

Generating a UUID per request? Building a timestamp string every second? Each one is unique, so the dedup table will record a hash, find no match, and store the entry — only to evict it on the next GC-linked table cleanup. The net result is CPU and memory bandwidth spent on hash computation and table writes that produce zero savings.

Scenario 3 — CPU-constrained environments

This is the most dangerous scenario in 2024’s containerised world. If your pod or VM has a limited CPU quota — say 0.5 to 1 vCPU — the background deduplication thread is competing directly with your request-serving threads. You may observe elevated 99th-percentile latencies that are almost impossible to attribute without profiling, because the culprit is not your code.

CPU Overhead of String Deduplication Across Workload Types

👁 Image

Approximate CPU overhead added by the deduplication background thread as % of total CPU budget. Lower is better.

Workload type	String repetition	Dedup memory saving	CPU cost (1–2 vCPU)	Verdict
Kafka consumer (domain enums)	Very high	15–30%	Low–Medium	Enable
REST API (JSON with common fields)	High	8–20%	Low	Enable
Computation service (UUIDs / hashes)	Very low	<1%	Medium	Disable
Log aggregator (unique log lines)	Very low	<2%	Medium–High	Disable
Batch processor (mixed data)	Medium	5–12%	Low–Medium	Measure first
Database ORM (repeated column names)	High	10–25%	Low	Enable

5. How to Measure Its Effect with JFR

Opinions about performance are worthless without data. Fortunately, the JVM ships with Java Flight Recorder (JFR) — a production-safe, low-overhead profiling mechanism built directly into the JDK since Java 11 (and backported to 8u262). It captures String Deduplication events natively.

Here is the cleanest way to capture a JFR recording for dedup analysis:

# Start a 120-second recording on a running process
jcmd <pid> JFR.start duration=120s filename=dedup-profile.jfr

# Dump an already-running recording
jcmd <pid> JFR.dump filename=dedup-profile.jfr

# Print a human-readable summary directly from the recording
jfr print --events StringDeduplication dedup-profile.jfr

Once you have the .jfr file, open it in JDK Mission Control (JMC) — the official GUI for JFR analysis. Under the Memory tab, look for the String Deduplication section. The two numbers that matter most are:

JFR metric	What it tells you	Healthy range
`Last Deduplication Time`	CPU time the background thread spent on one dedup pass	<5 ms per pass
`Deduplicated Bytes`	Total bytes freed by merging duplicate arrays	Should grow steadily if it’s worth running
`Dedup Table Size`	Number of unique strings currently tracked	Should stabilise; unbounded growth is a red flag
`New Table Entries`	How many new unique strings were seen in this pass	High value with low savings = unique-string workload

If Deduplicated Bytes is large and growing while Last Deduplication Time stays under a few milliseconds, String Deduplication is earning its keep. On the other hand, if the table keeps adding new entries without accumulating much freed memory, you are in the unique-string scenario — and you should disable it with -XX:-UseStringDeduplication.

You can also get a quick console summary at shutdown by adding -XX:+PrintStringDeduplicationStatistics to your JVM flags. Look for the Deduplicated vs Not Deduplicated row to get an instant signal.

A Realistic Diagnostic Workflow

Rather than guessing whether deduplication is helping, follow this three-step process in your staging environment before touching production:

Step	Action	Tool	Decision signal
1	Capture heap composition	`jmap -histo:live <pid>`	What % of heap is `[B` (byte arrays)?
2	Run JFR recording with dedup events	`jcmd` + JMC	Is `Deduplicated Bytes` significant?
3	Compare GC pause times	JFR GC events or `-Xlog:gc*`	Are pauses shorter with dedup on?

6. One More Surprising Detail: the Dedup Table Lives in the JVM Heap

Here is something that catches people off guard. The internal hash table that String Deduplication uses to track known string arrays is itself heap-allocated. This means that, in workloads with very high string cardinality (lots of unique strings), the dedup table can grow to consume meaningful heap space — sometimes enough to cause more GC pressure than the deduplication is relieving.

The JVM attempts to resize and clean the table in sync with GC cycles, but it does so lazily. In practice, if your dedup table’s New Table Entries metric keeps climbing without stabilising, you have found a workload where the feature is actively working against you. That is the signal to disable it — not as a pessimisation, but as a correction.

7. ZGC and Shenandoah Users: You Are Not Affected (Yet)

It is worth noting that if you have already migrated to ZGC or Shenandoah — both of which offer superior pause-time characteristics for most workloads — String Deduplication does not apply to you. As of Java 21, neither collector supports it. The ongoing work in OpenJDK around ZGC improvements does not currently include dedup support.

For ZGC/Shenandoah users who want similar benefits, the closest alternative is String.intern() used deliberately on high-repetition domain values, or application-level caching (an interning map) on the specific fields you know are repeated. That approach is more surgical — and more transparent in a profiler.

8. What We Have Learned

String Deduplication is a G1-exclusive JVM feature introduced in Java 8u20 that silently merges the internal byte[] arrays of identical string objects, freeing memory without changing object identity.
It runs on a background thread integrated with the G1 GC cycle — it is concurrent but not free, and it carries a real CPU cost that matters on containers with limited vCPUs.
It delivers its best results on microservices that repeatedly deserialize structured data with recurring field values: JSON APIs, Kafka consumers, ORM-heavy services. It is actively harmful in workloads dominated by high-cardinality, unique strings like UUIDs or log lines.
Java Flight Recorder gives you the exact data you need — Deduplicated Bytes, Last Deduplication Time, and Dedup Table Size — to make the enable/disable decision with evidence rather than intuition.
ZGC and Shenandoah users are unaffected; those collectors do not support deduplication as of Java 21.

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you our best selling eBooks for FREE!

1. JPA Mini Book

2. JVM Troubleshooting Guide

3. JUnit Tutorial for Unit Testing

4. Java Annotations Tutorial

5. Java Interview Questions

6. Spring Interview Questions

7. Android UI Design

and many more ....

I agree to the Terms and Privacy Policy

👁 Image

Thank you!

We will contact you soon.

URL: https://www.javacodegeeks.com/2026/05/string-deduplication-is-on-by-default-in-g1-and-most-developers-dont-know-what-it-does.html

⇱ String Deduplication Is On By Default in G1 — And Most Developers Don't Know What It Does - Java Code Geeks

1. Wait — It’s Already Running?

2. What Actually Happens at the Native Level

The GC-Integrated Pipeline

3. When It Genuinely Helps: The Microservice Case

4. When It Is a CPU Trade-Off Not Worth Making

Scenario 1 — Short-lived strings that never survive GC

Scenario 2 — Unique-content strings (UUIDs, timestamps, log lines)

Scenario 3 — CPU-constrained environments

5. How to Measure Its Effect with JFR

A Realistic Diagnostic Workflow

6. One More Surprising Detail: the Dedup Table Lives in the JVM Heap

7. ZGC and Shenandoah Users: You Are Not Affected (Yet)

8. What We Have Learned

Thank you!

Eleftheria Drosopoulou

Related Articles

Simple REST client in Java

Spring Boot Error – Error creating a bean with name ‘dataSource’ defined in class path resource DataSourceAutoConfiguration

How to fix Exception in thread “main” java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory in Java

Mockito: Cannot instantiate @InjectMocks field: the type is an interface

100 Java Spring Interview Questions & Answers – The ULTIMATE List (PDF Download)

Spring Boot Remove Embedded Tomcat Server, Enable Jetty Server

What is SecurityContext and SecurityContextHolder in Spring Security?

How to install Apache Web Server on EC2 Instance using User data script