VOOZH about

URL: https://dev.to/dev-shreyas/i-built-a-stable-sorting-algorithm-that-beats-javas-dual-pivot-quicksort-fck

⇱ I Built a Stable Sorting Algorithm That Beats Java's Dual-Pivot Quicksort and TimSort - DEV Community


A few days ago I finished benchmarking something I've been building - a cache-aware, stable, histogram-based sorting algorithm I'm calling BusSort. The results surprised even me.

At 100 million elements, it runs ~2x faster than Java's Dual-Pivot Quicksort on random data — while being stable. Dual-Pivot QS is not. And the generic variant beats TimSort by up to 8.9x on duplicate-heavy data.


The Problem With Quicksort at Scale

Quicksort-based algorithms partition elements with random writes across the entire array. At large scales this causes cache thrashing - elements are being written to memory locations all over the place, constantly missing L1 and L2 cache.

The larger the array, the worse it gets.


The Core Idea

Instead of scattering elements globally, BusSort processes data in L1 cache-sized chunks - 4096 integers (~16KB).

For each chunk, it does 4 passes:

  • PASS 1 - Scan left-to-right, compute bucket for each element, build a local histogram
  • PASS 2 - Compute local prefix sums (bucket positions within the chunk)
  • PASS 3 - Scatter into a local grouped buffer - because this buffer is L1-sized, all random writes stay in cache
  • PASS 4 - Copy each bucket's portion to its correct global position

With 128-way splitting, recursion depth stays at just ~4 levels even for 100M elements.

Base case: Insertion Sort for ≤ 1024 elements.

On the benchmark machine (i5-1135G7, 48KB L1 data cache):

4096 × 3 × 4 bytes = 49,152 bytes ≈ 48KB

The three working arrays fit exactly in L1. Not a coincidence.


Benchmark Results

int[] vs Dual-Pivot Quicksort

Tested against Arrays.sort(int[]) - Java's Dual-Pivot Quicksort.

n = 100,000,000 | Java 18 | i3-1115G4 @ 3.00GHz
Methodology: JMH 1.37 — 5 warmup iterations, 10 measurement iterations, 3 JVM forks (30 data points per benchmark)

Input Type BusSort (ms) ± Error DPQ (ms) ± Error Ratio
Random 3864 ±130 8344 ±268 2.2x
Duplicates 890 ±165 2266 ±31 2.5x
Few Duplicates 2232 ±273 3382 ±62 1.5x
Clustered 1527 ±35 2394 ±44 1.6x
Sorted 38 ±0.7 33 ±1.3 0.9x ❌
Reverse 191 ±2.9 89 ±2.2 0.5x ❌
Nearly Sorted 2726 ±190 2303 ±33 0.8x ❌
All Same 53 ±9.9 35 ±0.5 0.6x ❌

4/8 input types faster. Stable. Zero comparison overhead.

The ± Error columns confirm these gaps are statistically significant — not noise. Full benchmark code is in the benchmarks/ folder if you want to reproduce.


T[] vs TimSort (Generic)

Tested against Arrays.sort(T[]) - Java's TimSort.

n = 40,000,000 | Java 17 | i5-1135G7 @ 2.40GHz
Methodology: JMH 1.37 — 5 warmup iterations, 10 measurement iterations, 3 JVM forks (30 data points per benchmark)

Input Type BusSort (ms) ± Error TimSort (ms) ± Error Ratio
Random 7357 ±162 21691 ±277 2.9x
Duplicates 619 ±8 5510 ±55 8.9x
Few Duplicates 3680 ±90 7412 ±114 2.0x
Clustered 1804 ±97 5599 ±103 3.1x
All Same 34 ±0.6 34 ±0.8 ~1x ✅
Sorted 63 ±0.2 61 ±0.3 ~1x
Nearly Sorted 2541 ±103 1835 ±160 0.7x ❌
Reverse 652 ±11 313 ±2 0.5x ❌

5/8 input types faster. Naturally stable. 8.9x on duplicates.

Full benchmark code is in the generics/benchmarks/ folder if you want to reproduce.


Why It's Stable

BusSort preserves relative order of equal elements naturally — no extra bookkeeping needed:

  • Chunks are processed left-to-right
  • Within each chunk, elements are scattered left-to-right
  • globalNext[b] advances per chunk - earlier chunks always land before later chunks

This makes BusSort one of very few algorithms that is both faster than Dual-Pivot Quicksort and TimSort and stable.

For int[], equal integers are identical by value — stability is technically unobservable. However, the stable reverse path is intentionally preserved for consistency with the stability guarantee and to serve as a reference for future ports.


Trade-offs

  • Requires O(n) auxiliary space — standard for stable sorts
  • Generic variant always allocates O(n) auxiliary space — TimSort is adaptive on structured inputs
  • Slower on already-sorted, reverse-sorted, and nearly-sorted input — TimSort's run detection is unbeatable on these cases

What's Next

  • Dynamic stack sizing
  • Auto-tune BUS_SIZE based on runtime L1 cache size
  • Parallel variant
  • Formal write-up / paper

GitHub

Full implementation, benchmarks, and generic variant published.

👉 github.com/dev-shreyaspatil/BusSort


Curious what the Java community thinks — especially around the cache-chunking approach and whether there are similar published algorithms I should be comparing against.