VOOZH about

URL: https://dev.to/ddinhcchi/why-every-computer-vision-team-ends-up-rewriting-the-same-video-clip-pipeline-1pb1

⇱ Why Every Computer Vision Team Ends Up Rewriting the Same Video Clip Pipeline - DEV Community


Shipping Evidence Clips for Computer Vision Events

If you've shipped a computer vision system to production, you know the moment.

The detector fires. The alert fires. And then someone on ops opens the alert, sees:

{"event_id":"violation_001","timestamp":1716530001.2}

and replies:

"OK, where's the video?"

That's the gap this post is about.

The Actual Problem

What ops wants is a short MP4 — typically 10–30 seconds — with the bounding box drawn on top of the relevant footage, so they can open it in QuickTime or VLC and forward it to whoever needs to see it.

Not a JSON sidecar.

Not a frame extraction.

A short video with the box visibly on the suspect.

Building this turns out to be a chain of small problems, each of which is "fine, I'll just do it myself":

  1. Open the source video (RTSP feed or saved file).
  2. Seek to the event window.
  3. Decode the frames.
  4. Look up the detection for each frame.
  5. Draw the box and label cleanly.
  6. Encode the result to MP4.
  7. Handle the case where the event spans two files because your NVR cuts recordings at the hour boundary.
  8. Handle the case where your operator wants ten events from a half-day recording without waiting half a day.

Every CV team I've worked with hand-rolls this pipeline once, ships the off-by-one to production, then writes it again on the next project.

Three months later, someone breaks the FFmpeg subprocess invocation and nobody notices for two weeks because the smoke test only checks:

"Did a file get written?"


What's Actually Available

Three things come close.

Supervision

supervision is an excellent drawing-utilities library (39k+ GitHub stars).

But its VideoSink is essentially cv2.VideoWriter with mp4v hard-coded.

It has:

  • No event-window trimming
  • No codec selection
  • No concept of events spanning multiple files

It's a tool for:

"Annotate every frame and write the whole thing back out."


DeepStream Smart Record

DeepStream Smart Record is NVIDIA's official solution.

It works.

In C.

The official Python bindings (pyds) still don't expose Smart Record functionality. NVIDIA staff have confirmed this on their forums, and that situation remained unchanged through DeepStream 7.1.

There are community forks that provide custom wheels as a workaround.

Smart Record itself also has open reports involving multi-stream crashes on DS 7.1.

If you're already inside a DeepStream pipeline, it can be a good option.

If you're not, you may end up learning DeepStream just to produce an MP4 clip.


KeyClipWriter

KeyClipWriter from PyImageSearch is the ring-buffer pattern everyone copies.

It's a tutorial, not a maintained library.

It's detection-agnostic, so you wire up all overlay logic yourself.

The trim semantics are roughly:

"Whatever OpenCV ends up doing."


So the landscape looks like:

  • A popular drawing library that doesn't ship clips
  • A vendor SDK with limited Python support
  • A fifteen-year-old tutorial

The gap is real.


A Library Version of That Pattern

I wrote cv-evidence-renderer to be the library version of what every team eventually hand-rolls.

MIT licensed.

Pure Python install.

The simplest usage looks like this:

from cv_evidence_renderer import render_from_jsonl

render_from_jsonl(
 video="incidents/raw_001.mp4",
 detections_jsonl="incidents/raw_001.detections.jsonl",
 event_start=12.5,
 event_end=22.0,
 output="evidence/event_001.mp4",
)

Events That Span NVR File Boundaries

Hour-segmented NVR recordings are common.

If an event starts near the end of one file and continues into the next, you can render it as a single continuous clip:

from cv_evidence_renderer import render_clip, ClipSource

render_clip(
 sources=[
 ClipSource(
 video="cam_22-00.mp4",
 detections="d_22.jsonl",
 from_seconds=1770,
 to_seconds=1800,
 ),
 ClipSource(
 video="cam_23-00.mp4",
 detections="d_23.jsonl",
 from_seconds=0,
 to_seconds=90,
 ),
 ],
 output="evidence_cross_file.mp4",
)

The output is one continuous MP4.

Each detection JSONL remains keyed to its own local file timeline:

  • Frame 0 = first frame of that file
  • No global concatenated timeline required

All sources must share:

  • Width
  • Height
  • FPS (within 1%)

Resampling is intentionally out of scope.


Batch Rendering Shared Sources

Things get more interesting when you have many events from one long recording.

The naïve approach:

for event in events:
 render_clip(...)

Each render starts decoding from the beginning again.

That's a lot of duplicated work.

So the library includes a batch API:

from cv_evidence_renderer import Clip, ClipSource, render_clips

render_clips(
 clips=[
 Clip(
 sources=[
 ClipSource(
 video="day.mp4",
 detections="day.jsonl",
 from_seconds=s,
 to_seconds=e,
 )
 ],
 output=f"evidence/event_{i:03d}.mp4",
 max_duration_seconds=15,
 duration_strategy="timelapse",
 )
 for i, (s, e) in enumerate(events)
 ],
)

When multiple clips reference the same source file:

  • The file is opened once
  • Frames are decoded once
  • Each decoded frame is dispatched to all interested clip encoders

Each clip can still have:

  • Different overlays
  • Different frame strides
  • Different duration strategies

Unique-source clips fall back to the standard per-clip path.


Where This Fits in the Pipeline

The scope is intentionally small.

The library does not:

  • Perform detection
  • Perform tracking
  • Handle alerting
  • Handle live streaming

Bring your own detector:

  • YOLO
  • Detectron2
  • Anything that can produce bounding boxes

Bring your own tracker.

The library does one thing:

Take video + detections and produce the evidence MP4 your ops team actually wanted.


YOLO Integration Example

from ultralytics import YOLO
from cv_evidence_renderer.adapters import from_yolo_results

model = YOLO("yolov8n.pt")

results = model("incidents/raw_001.mp4")

frame_detections = [
 from_yolo_results(r, frame_idx=i)
 for i, r in enumerate(results)
]

Benchmark

Measured end-to-end:

Decode → Overlay → Encode

Hardware:

  • Apple M4 CPU
  • libx264 encoder
Resolution Render Time (5s @ 30fps) Throughput
480p 0.53 s 282 fps
720p 0.89 s 168 fps
1080p 1.70 s 88 fps

Benchmark command:

python scripts/benchmark.py

NVENC support is planned for v0.2.


What It Doesn't Do (Yet)

NVENC GPU Encoding

Designed and stubbed, but not wired up yet.

For many offline workflows, CPU rendering is already faster than real time through 1080p.


Live RTSP Recording

The EvidenceRecorder API exists but currently raises:

NotImplementedError

Ring buffers and keyframe-aware trigger logic are planned for v0.2.


Custom Zones, Lines, and Overlay Plugins

Planned for v0.3.

The plugin API needs real-world feedback before being frozen.


Installation

pip install cv-evidence-renderer

Optional Supervision adapters:

pip install cv-evidence-renderer[supervision]

Links

MIT licensed.

CI on:

  • Linux
  • macOS

Across:

  • Python 3.10
  • Python 3.11
  • Python 3.12

Feedback is welcome — open an issue or reach out through the repository.