VOOZH about

URL: https://dev.to/yasha1971coder/reviving-glyph-v8-from-a-forgotten-prototype-to-stride-a-field-aware-integer-coder-h24

⇱ Reviving glyph-v8: STRIDE — A Deterministic Field-Aware Integer Analyzer This is a submission for the GitHub Finish-Up-A-Thon Challenge - DEV Community


GitHub “Finish-Up-A-Thon” Challenge Submission

What I Built

STRIDE is a deterministic, field-aware integer analysis engine revived from the abandoned glyph-v8 prototype.

Not a general compressor. A precision primitive that does one thing no existing tool does: profile binary protocol data field by field, build per-field entropy models, and identify exactly where compression gains are possible.

General compressors like zstd see a byte stream. STRIDE sees structure.

The Problem

Binary protocols move billions of messages daily — Protobuf, MessagePack, Thrift. Their integer fields are not random:

• Timestamps delta from the previous value
• Status codes are almost always 200
• IDs increment monotonically
• Enums repeat from a tiny set

zstd doesn’t know this. It compresses the whole stream as if every byte were unpredictable. STRIDE knows field boundaries — and that changes everything about what’s compressible.

Demo

Repository: github.com/yasha1971-coder/glyph-v8

Live benchmark: enwik8 (100,000,000 bytes, OVH EPYC server)

$ stride container-bytefreq enwik8.stridebin --top 5
Total bytes processed: 100,000,000
32 0x20 13,519,824 (13.52%) ← space dominates
101 0x65 8,001,205 (8.00%)
116 0x74 6,154,908 (6.15%)
97 0x61 5,712,026 (5.71%)
105 0x69 5,227,649 (5.23%)

$ stride container-hotspots enwik8.stridebin --top 3
Chunk 635 Entropy: 5.685 ← highest information density
Chunk 634 Entropy: 5.609
Chunk 636 Entropy: 5.534

$ stride container-headersketch enwik8.stridebin --size 8
Bucket 15: 0.574
Bucket 33: 0.663
Bucket 41: 0.605
Bucket 48: 0.660

Timing on 100MB corpus:

Module Time Output
ByteFreq 1.97s 256-byte histogram
Hotspots 4.17s Entropy map across 1,526 chunks
HeaderSketch 4.40s 64-slot structural profile
Fingerprint 71.6s 128 MinHash values (known: O(n·k) rolling hash)

⚡ STRIDE vs zstd — I/O Performance

STRIDE is not a compressor — it's a deterministic container. Comparison is I/O throughput only.

Operation Tool Time Size
Encode STRIDE 0.173s 96MB
Encode zstd -1 0.240s 39MB
Encode zstd -9 2.146s 31MB
Decode STRIDE 0.089s 100MB
Decode zstd -d 0.125s 100MB

STRIDE encode: 28% faster than zstd -1
STRIDE decode: 40% faster than zstd -d

Trade-off: STRIDE does not compress. Use zstd for compression. Use STRIDE for deterministic container I/O.

Proof with SHA256 verification: proof/enwik8_benchmark.txt

V1 benchmark proof: proof/v1_benchmark.txt

Before → After

Before (glyph-v8, 3 months abandoned):

• Experimental L0-index with minimizer indexing
• No documentation, no architecture, no clear purpose
• Code sitting unused on an OVH server
• hit_rate 87.6% on old version, 99.8% on new — but no one knew

After (STRIDE v0):

• Full CLI with 10 commands
• Deterministic corpus analysis on any binary data
• Real benchmark on enwik8 100MB with SHA256-verified proof
• stride/ package installable via pip install -e .
• Structured container format (STRIDE01 magic, chunked layout)
• Cross-platform: Linux + OVH EPYC verified
 • GitHub Actions CI — tests pass on every push

Architecture

RAW CORPUS

STRIDE Container (.stridebin)
[MAGIC: STRIDE01][corpus_size][chunk_size][data...]

Analysis Layer:
container-bytefreq → byte frequency histogram
container-hotspots → entropy per chunk
container-fingerprint → 128-value MinHash
container-headersketch → 64-slot structural sketch

Model Output (model.json):
timestamp_field → Delta coding
status_field → Dictionary coding
id_field → Rice coding

STRIDE v1 ✅: container-write (575 MB/s) + container-decode (1,053 MB/s)
container-compare --fast → HeaderSketch similarity in 7s (vs 150s full mode)

What Makes STRIDE Different

grep zstd Elasticsearch STRIDE
Field-aware
Per-field entropy model
Deterministic output
Schema-aware analysis partial
SHA256-verified proof

Honest Benchmark Status

STRIDE v0 is a corpus analyzer, not a codec. It does not yet produce compressed output.

STRIDE v1 shipped. Encoder: 575 MB/s. Decoder: 1,053 MB/s. Round-trip MD5-verified on enwik8 100MB.

👁 Entropy Heatmap

Red = high entropy (hard to compress) | Yellow = moderate | Each cell = 64KB chunk of enwik8

Theoretical compression gains (6-8x vs zstd on integer-heavy data) are derived from the entropy models STRIDE builds — not from measured compression results.

This is intentional. STRIDE v0 establishes the measurement foundation. STRIDE v1 builds on it.

How GitHub Copilot Helped

The original glyph-v8 was a pile of experimental scripts with no coherent design. Copilot helped:

• Reconstruct the project from scattered OVH files
• Design the StrideContainer format and reader
• Build the CLI dispatch architecture (argparse + subcommands)
• Implement all five analysis modules
• Write the benchmark pipeline with SHA256 verification
• Structure this submission

Without Copilot the gap between “abandoned prototype” and “installable system with proof” would have taken weeks. It took days.

Project Family

STRIDE is the third primitive in a deterministic systems family:

ACEAPEX — parallel LZ77 decode
9,903 MB/s on EPYC 9575F (64 cores). 2.5x faster than zstd. Merged into lzbench.

GLYPH — byte-exact substring retrieval
6,888x faster than grep on repeated queries. 1,138 organic git clones in 14 days with zero promotion.

STRIDE — field-aware integer analysis
Profiles binary protocol data. Builds per-field entropy models. Foundation for a codec that knows what zstd doesn’t.

Same philosophy across all three: deterministic, exact, measurable.

What’s Next

• Full benchmark suite vs zstd, LZ4, Brotli
• Protobuf schema-aware field extraction
• MessagePack and Thrift adapters
• Publish as standalone Python package on PyPI

Inspired by Perelman’s geometrization — the idea that complex structures simplify under the right flow. Every project in this family is an attempt to find that flow.