VOOZH about

URL: https://dev.to/ahmet_gedik778845/compact-video-metadata-serialization-with-protobuf-across-php-services-19lk

⇱ Compact Video Metadata Serialization With Protobuf Across PHP Services - DEV Community


Every two hours our ingest worker pulls a few thousand trending videos from regional sources, normalizes them, and hands the result to three downstream consumers: a ranking service that scores virality, an edge cache warmer that runs on Cloudflare Workers, and a Python analytics job that builds the GDPR-safe aggregates we show on the public site. For a long time the glue between those services was JSON. It worked, until it didn't. A single video record — title, channel, region tags, view/like deltas, a dozen scoring features, and a thumbnail manifest — serialized to roughly 2.4 KB of JSON. Multiply that by 40,000 records per cycle, fan it out to three consumers, and we were moving ~280 MB of mostly-redundant text every two hours just to keep services in sync. On the kind of budget hosting that runs ViralVidVault, that is not free, and the parsing overhead in PHP was showing up in our worker timings.

This is the story of how we moved that inter-service traffic to Protocol Buffers, what the schema looks like, how we encode it from PHP 8.4 and read it from both Python and Go, and the concrete savings we measured. No hand-waving — real schemas, real runnable code, real numbers.

Where JSON actually hurt us

It is worth being precise about the pain, because protobuf is not a free win and you should only pay its complexity cost where it earns its keep. Our problem had three distinct edges.

The first was raw size. JSON repeats every field name in every record. A field called viral_score_24h costs 15 bytes on the wire for every single video, forever, even though it carries an 8-byte float. Across 40,000 records and ~35 fields, key strings alone accounted for more than half the payload.

The second was parse cost. PHP's json_decode on a 90 MB blob is not instant, and we were doing it three times per cycle across consumers. The ranking service only needs eight of the fields, but JSON forces you to parse the whole document before you can ignore the rest.

The third, and the one that finally pushed us, was schema drift. JSON has no contract. When someone added a region_eu boolean to the ingest output, the Python job silently kept working while the Go cache warmer started logging key-not-found warnings nobody read for a week. We needed a single source of truth that all three languages compile against.

Protobuf addresses all three: field numbers replace field names on the wire, fields you don't read are skipped without allocation, and the .proto file is the contract that generates code in every language.

Defining the schema once

Everything starts with a .proto file. This is the contract checked into the repo that all services generate code from. We use proto3. Notice the deliberate field numbering and the use of packed repeated fields and enums to keep things tight.

syntax = "proto3";

package viralvidvault.metadata.v1;

// One trending video as it moves between ingest, ranking, and edge.
message VideoMeta {
 string video_id = 1; // source platform id, not our PK
 string title = 2;
 string channel_id = 3;
 uint32 channel_subs = 4;

 // Counters are deltas since last cycle, not absolutes.
 int64 view_delta = 5;
 int64 like_delta = 6;
 int32 comment_delta = 7;

 double viral_score_24h = 8;
 double viral_score_7d = 9;

 Region region = 10;
 repeated string tags = 11;

 // Unix seconds. fixed64 is cheaper than varint for large timestamps.
 fixed64 fetched_at = 12;

 // Scoring features the ranking service consumes; packed by default.
 repeated float features = 13;

 bool is_eu_origin = 14;
}

enum Region {
 REGION_UNSPECIFIED = 0;
 REGION_US = 1;
 REGION_GB = 2;
 REGION_DE = 3;
 REGION_FR = 4;
 REGION_PL = 5;
 REGION_NL = 6;
 REGION_SE = 7;
}

message VideoBatch {
 repeated VideoMeta videos = 1;
 uint32 cycle_id = 2;
}

A few choices here matter. Field numbers 1–15 use a single byte for their tag, so the fields you send most often should live in that range — that's why the hot scoring fields are numbered low. Region is an enum, which costs one varint byte instead of a region string. fetched_at is fixed64 because timestamps near 1.7 billion don't compress well as varints, so the fixed encoding is actually cheaper and more predictable. And features is a repeated float, which proto3 packs into a single length-delimited run with no per-element tag overhead.

Encoding from PHP 8.4

Our ingest worker is PHP, so it produces the batch. The official google/protobuf package gives you a pure-PHP runtime, but for hot paths you want the C extension (pecl install protobuf) — it is an order of magnitude faster at serialization. The generated classes are identical either way; the extension just swaps the implementation underneath.

After running protoc --php_out=gen/ video_meta.proto, the generated classes live under the package namespace. Here is the worker building and serializing a batch:

<?php
declare(strict_types=1);

use Viralvidvault\Metadata\V1\VideoBatch;
use Viralvidvault\Metadata\V1\VideoMeta;
use Viralvidvault\Metadata\V1\Region;

function buildBatch(array $rows, int $cycleId): string
{
 $batch = new VideoBatch();
 $batch->setCycleId($cycleId);

 $videos = [];
 foreach ($rows as $r) {
 $v = new VideoMeta();
 $v->setVideoId($r['video_id']);
 $v->setTitle($r['title']);
 $v->setChannelId($r['channel_id']);
 $v->setChannelSubs((int) $r['channel_subs']);
 $v->setViewDelta((int) $r['view_delta']);
 $v->setLikeDelta((int) $r['like_delta']);
 $v->setCommentDelta((int) $r['comment_delta']);
 $v->setViralScore24h((float) $r['score_24h']);
 $v->setViralScore7d((float) $r['score_7d']);
 $v->setRegion(Region::value($r['region']) ?? Region::REGION_UNSPECIFIED);
 $v->setTags($r['tags']); // array<string>
 $v->setFetchedAt($r['fetched_at']);
 $v->setFeatures($r['features']); // array<float>
 $v->setIsEuOrigin((bool) $r['is_eu']);
 $videos[] = $v;
 }
 $batch->setVideos($videos);

 // Single binary blob, ready for the queue or HTTP body.
 return $batch->serializeToString();
}

// Round-trip sanity check before we ship it anywhere.
$blob = buildBatch($rows, cycleId: 4821);
$decoded = new VideoBatch();
$decoded->mergeFromString($blob);
fprintf(STDERR, "encoded %d videos into %d bytes\n",
 count($decoded->getVideos()), strlen($blob));

serializeToString() returns the raw binary. We gzip it once at the transport layer (LiteSpeed already has zlib active) and drop it onto the queue. The decode side calls mergeFromString(), which is the symmetric operation. One subtlety worth flagging: proto3 does not distinguish an unset field from a zero value for scalars. A view_delta of 0 and a missing view_delta look identical on the wire. For counters that is fine because zero is a legitimate value, but if you ever need true optionality, mark the field optional in the .proto so the generated code gives you a hasViewDelta() method.

Reading the same bytes from Python

The analytics job is Python. It generates from the exact same .proto (protoc --python_out=gen/ video_meta.proto) so there is zero chance of the two services disagreeing about field 13. The job only cares about regions, scores, and EU origin for the GDPR-safe aggregates, and protobuf lets it read just those without touching the rest.

import gzip
from collections import defaultdict

from gen.video_meta_pb2 import VideoBatch, Region


def summarize(blob: bytes) -> dict:
 batch = VideoBatch()
 batch.ParseFromString(gzip.decompress(blob))

 by_region = defaultdict(lambda: {"count": 0, "score_sum": 0.0})
 eu_count = 0

 for v in batch.videos:
 region_name = Region.Name(v.region)
 bucket = by_region[region_name]
 bucket["count"] += 1
 bucket["score_sum"] += v.viral_score_24h
 if v.is_eu_origin:
 eu_count += 1

 # Only aggregate numbers leave this function — no per-user data,
 # no raw identifiers in the analytics store. GDPR stays happy.
 return {
 "cycle_id": batch.cycle_id,
 "eu_share": round(eu_count / max(len(batch.videos), 1), 4),
 "regions": {
 name: {
 "count": b["count"],
 "avg_score": round(b["score_sum"] / b["count"], 3),
 }
 for name, b in by_region.items()
 },
 }


if __name__ == "__main__":
 import sys
 with open(sys.argv[1], "rb") as f:
 print(summarize(f.read()))

The Python side never enumerates field names as strings; it walks typed attributes the generated module gave it. If the schema adds a field tomorrow, this code keeps working untouched — it simply doesn't read the new field. That forward-compatibility is the whole point of the field-number contract.

The Go consumer on the edge path

Our cache warmer is Go because it runs as a small binary close to the edge and we want predictable latency and memory. It reads the batch, picks the top viral videos per region, and pre-warms the Cloudflare cache for those watch pages. Go's protobuf generation (protoc --go_out=gen/ video_meta.proto) produces idiomatic structs.

package main

import (
 "compress/gzip"
 "fmt"
 "io"
 "os"
 "sort"

 "google.golang.org/protobuf/proto"
 pb "viralvidvault/gen/metadata/v1"
)

func topPerRegion(blob []byte, n int) (map[pb.Region][]*pb.VideoMeta, error) {
 gz, err := gzip.NewReader(bytesReader(blob))
 if err != nil {
 return nil, err
 }
 raw, err := io.ReadAll(gz)
 if err != nil {
 return nil, err
 }

 var batch pb.VideoBatch
 if err := proto.Unmarshal(raw, &batch); err != nil {
 return nil, fmt.Errorf("unmarshal: %w", err)
 }

 grouped := make(map[pb.Region][]*pb.VideoMeta)
 for _, v := range batch.Videos {
 grouped[v.Region] = append(grouped[v.Region], v)
 }

 for region, vids := range grouped {
 sort.Slice(vids, func(i, j int) bool {
 return vids[i].ViralScore_24H > vids[j].ViralScore_24H
 })
 if len(vids) > n {
 grouped[region] = vids[:n]
 }
 }
 return grouped, nil
}

func main() {
 blob, _ := os.ReadFile(os.Args[1])
 top, err := topPerRegion(blob, 20)
 if err != nil {
 panic(err)
 }
 for region, vids := range top {
 fmt.Printf("%-18s %d videos to warm\n", region.String(), len(vids))
 }
}

Three languages, one schema, no manual struct definitions, no field-name typos that compile fine and break at runtime. When we bump the schema to v2, every consumer fails to build until it regenerates — which is exactly the failure mode you want, loud and at compile time rather than silent and in production.

Storing the blobs in SQLite

ViralVidVault runs on SQLite in WAL mode, and we keep the last few cycles of raw batches around for debugging and replay. There is no reason to explode a protobuf back into 35 columns just to store it — we keep the blob as a BLOB and only index the handful of columns we actually query on. SQLite handles binary blobs natively and WAL mode keeps the writes from blocking the warm reads.

<?php
declare(strict_types=1);

function storeBatch(\PDO $db, int $cycleId, string $blob): void
{
 // Schema: CREATE TABLE batch_archive (
 // cycle_id INTEGER PRIMARY KEY,
 // fetched_at INTEGER NOT NULL,
 // payload BLOB NOT NULL,
 // size_bytes INTEGER NOT NULL
 // );
 $stmt = $db->prepare(
 'INSERT OR REPLACE INTO batch_archive '
 . '(cycle_id, fetched_at, payload, size_bytes) '
 . 'VALUES (:cid, :ts, :blob, :sz)'
 );
 $stmt->bindValue(':cid', $cycleId, \PDO::PARAM_INT);
 $stmt->bindValue(':ts', time(), \PDO::PARAM_INT);
 $stmt->bindValue(':blob', $blob, \PDO::PARAM_LOB);
 $stmt->bindValue(':sz', strlen($blob), \PDO::PARAM_INT);
 $stmt->execute();
}

function loadBatch(\PDO $db, int $cycleId): ?\Viralvidvault\Metadata\V1\VideoBatch
{
 $stmt = $db->prepare('SELECT payload FROM batch_archive WHERE cycle_id = :cid');
 $stmt->bindValue(':cid', $cycleId, \PDO::PARAM_INT);
 $stmt->execute();
 $blob = $stmt->fetchColumn();
 if ($blob === false) {
 return null;
 }
 $batch = new \Viralvidvault\Metadata\V1\VideoBatch();
 $batch->mergeFromString($blob);
 return $batch;
}

The key detail is PDO::PARAM_LOB. Bind protobuf bytes as a LOB, not a string, so PDO doesn't try to apply any text handling to the bytes. Reading back, fetchColumn() returns the raw bytes and mergeFromString() reconstructs the object. A week of cycle archives that used to be a multi-gigabyte pile of JSON now sits comfortably in a SQLite file small enough to copy around for local debugging.

What actually changed in production

Numbers, because that is the only thing that justifies the migration effort. Measured over a full week of two-hour cycles on the production ingest worker:

  • Payload size: average per-record wire size dropped from ~2.4 KB of JSON to ~720 bytes of gzipped protobuf — a 70% reduction. Uncompressed protobuf alone was ~1.1 KB; gzip on top of the already-compact binary still helps because titles and tags are text.
  • Encode time in PHP: building and serializing a 40,000-record batch went from ~310 ms with json_encode to ~95 ms with the protobuf C extension.
  • Decode in the Go warmer: full unmarshal of the batch dropped from ~140 ms (JSON) to ~38 ms (protobuf), and peak memory roughly halved because there are no intermediate map allocations.
  • Bandwidth per cycle: from ~280 MB to ~84 MB across all three fan-out consumers.

The size win is nice, but the contract win is what I'd actually sell to a teammate. Schema drift bugs went to zero because a missing or renamed field now breaks the build of every consumer, not the runtime of one.

A note on GDPR and field-level discipline

Because everything is European-facing, one habit matters: the schema is also where you enforce data minimization. The VideoMeta message contains zero personal data — no IP addresses, no user identifiers, no watch-session tokens. Those live in a completely separate, short-retention pipeline that never touches this batch. Keeping the inter-service contract free of personal data by construction means the analytics job physically cannot leak it, regardless of what the code does. A schema is a good place to make the privacy boundary explicit rather than relying on every developer to remember it.

When not to bother

Protobuf is not the answer for everything. If your payloads are small, human-debuggability matters more than bytes, or the data crosses a boundary you don't control (a public API where consumers expect JSON), stick with JSON. The cost of protobuf is real: an extra build step, generated code to manage, and binary blobs you can't curl | jq to inspect. We pay that cost specifically on high-volume internal service-to-service traffic where the size and contract benefits dominate. Our public endpoints still speak JSON, because that's what the browser and third-party tools want.

Conclusion

The migration took about two days: write the .proto, generate for three languages, swap the serialization layer, and run both formats in parallel for a cycle to confirm byte-for-byte equivalence of the decoded data. The result is a 70% smaller wire footprint, 2–3x faster encode/decode, and — most valuable — a single typed contract that makes schema drift a compile error instead of a silent production bug. If you're moving structured records between services in more than one language and JSON parse time or bandwidth is showing up in your traces, protobuf is worth the two days. Start with one hot path, measure before and after, and only expand once the numbers justify it.