Shipping Evidence Clips for Computer Vision Events
If you've shipped a computer vision system to production, you know the moment.
The detector fires. The alert fires. And then someone on ops opens the alert, sees:
{"event_id":"violation_001","timestamp":1716530001.2}and replies:
"OK, where's the video?"
That's the gap this post is about.
The Actual Problem
What ops wants is a short MP4 — typically 10–30 seconds — with the bounding box drawn on top of the relevant footage, so they can open it in QuickTime or VLC and forward it to whoever needs to see it.
Not a JSON sidecar.
Not a frame extraction.
A short video with the box visibly on the suspect.
Building this turns out to be a chain of small problems, each of which is "fine, I'll just do it myself":
- Open the source video (RTSP feed or saved file).
- Seek to the event window.
- Decode the frames.
- Look up the detection for each frame.
- Draw the box and label cleanly.
- Encode the result to MP4.
- Handle the case where the event spans two files because your NVR cuts recordings at the hour boundary.
- Handle the case where your operator wants ten events from a half-day recording without waiting half a day.
Every CV team I've worked with hand-rolls this pipeline once, ships the off-by-one to production, then writes it again on the next project.
Three months later, someone breaks the FFmpeg subprocess invocation and nobody notices for two weeks because the smoke test only checks:
"Did a file get written?"
What's Actually Available
Three things come close.
Supervision
supervision is an excellent drawing-utilities library (39k+ GitHub stars).
But its VideoSink is essentially cv2.VideoWriter with mp4v hard-coded.
It has:
- No event-window trimming
- No codec selection
- No concept of events spanning multiple files
It's a tool for:
"Annotate every frame and write the whole thing back out."
DeepStream Smart Record
DeepStream Smart Record is NVIDIA's official solution.
It works.
In C.
The official Python bindings (pyds) still don't expose Smart Record functionality. NVIDIA staff have confirmed this on their forums, and that situation remained unchanged through DeepStream 7.1.
There are community forks that provide custom wheels as a workaround.
Smart Record itself also has open reports involving multi-stream crashes on DS 7.1.
If you're already inside a DeepStream pipeline, it can be a good option.
If you're not, you may end up learning DeepStream just to produce an MP4 clip.
KeyClipWriter
KeyClipWriter from PyImageSearch is the ring-buffer pattern everyone copies.
It's a tutorial, not a maintained library.
It's detection-agnostic, so you wire up all overlay logic yourself.
The trim semantics are roughly:
"Whatever OpenCV ends up doing."
So the landscape looks like:
- A popular drawing library that doesn't ship clips
- A vendor SDK with limited Python support
- A fifteen-year-old tutorial
The gap is real.
A Library Version of That Pattern
I wrote cv-evidence-renderer to be the library version of what every team eventually hand-rolls.
MIT licensed.
Pure Python install.
The simplest usage looks like this:
from cv_evidence_renderer import render_from_jsonl
render_from_jsonl(
video="incidents/raw_001.mp4",
detections_jsonl="incidents/raw_001.detections.jsonl",
event_start=12.5,
event_end=22.0,
output="evidence/event_001.mp4",
)
Events That Span NVR File Boundaries
Hour-segmented NVR recordings are common.
If an event starts near the end of one file and continues into the next, you can render it as a single continuous clip:
from cv_evidence_renderer import render_clip, ClipSource
render_clip(
sources=[
ClipSource(
video="cam_22-00.mp4",
detections="d_22.jsonl",
from_seconds=1770,
to_seconds=1800,
),
ClipSource(
video="cam_23-00.mp4",
detections="d_23.jsonl",
from_seconds=0,
to_seconds=90,
),
],
output="evidence_cross_file.mp4",
)
The output is one continuous MP4.
Each detection JSONL remains keyed to its own local file timeline:
- Frame 0 = first frame of that file
- No global concatenated timeline required
All sources must share:
- Width
- Height
- FPS (within 1%)
Resampling is intentionally out of scope.
Batch Rendering Shared Sources
Things get more interesting when you have many events from one long recording.
The naïve approach:
for event in events:
render_clip(...)
Each render starts decoding from the beginning again.
That's a lot of duplicated work.
So the library includes a batch API:
from cv_evidence_renderer import Clip, ClipSource, render_clips
render_clips(
clips=[
Clip(
sources=[
ClipSource(
video="day.mp4",
detections="day.jsonl",
from_seconds=s,
to_seconds=e,
)
],
output=f"evidence/event_{i:03d}.mp4",
max_duration_seconds=15,
duration_strategy="timelapse",
)
for i, (s, e) in enumerate(events)
],
)
When multiple clips reference the same source file:
- The file is opened once
- Frames are decoded once
- Each decoded frame is dispatched to all interested clip encoders
Each clip can still have:
- Different overlays
- Different frame strides
- Different duration strategies
Unique-source clips fall back to the standard per-clip path.
Where This Fits in the Pipeline
The scope is intentionally small.
The library does not:
- Perform detection
- Perform tracking
- Handle alerting
- Handle live streaming
Bring your own detector:
- YOLO
- Detectron2
- Anything that can produce bounding boxes
Bring your own tracker.
The library does one thing:
Take video + detections and produce the evidence MP4 your ops team actually wanted.
YOLO Integration Example
from ultralytics import YOLO
from cv_evidence_renderer.adapters import from_yolo_results
model = YOLO("yolov8n.pt")
results = model("incidents/raw_001.mp4")
frame_detections = [
from_yolo_results(r, frame_idx=i)
for i, r in enumerate(results)
]
Benchmark
Measured end-to-end:
Decode → Overlay → Encode
Hardware:
- Apple M4 CPU
- libx264 encoder
| Resolution | Render Time (5s @ 30fps) | Throughput |
|---|---|---|
| 480p | 0.53 s | 282 fps |
| 720p | 0.89 s | 168 fps |
| 1080p | 1.70 s | 88 fps |
Benchmark command:
python scripts/benchmark.py
NVENC support is planned for v0.2.
What It Doesn't Do (Yet)
NVENC GPU Encoding
Designed and stubbed, but not wired up yet.
For many offline workflows, CPU rendering is already faster than real time through 1080p.
Live RTSP Recording
The EvidenceRecorder API exists but currently raises:
NotImplementedError
Ring buffers and keyframe-aware trigger logic are planned for v0.2.
Custom Zones, Lines, and Overlay Plugins
Planned for v0.3.
The plugin API needs real-world feedback before being frozen.
Installation
pip install cv-evidence-renderer
Optional Supervision adapters:
pip install cv-evidence-renderer[supervision]
Links
- GitHub: https://github.com/ddinhcchi/cv-evidence-renderer
- Documentation: https://ddinhcchi.github.io/cv-evidence-renderer/
MIT licensed.
CI on:
- Linux
- macOS
Across:
- Python 3.10
- Python 3.11
- Python 3.12
Feedback is welcome — open an issue or reach out through the repository.
For further actions, you may consider blocking this person and/or reporting abuse
