SwiftVR: Real-Time One-Step Generative Video Restoration
SwiftVR is the first generative video restoration model to reach real-time 1080p streaming on a consumer-grade GPU (β26 FPS on a single RTX 5090), sustains 31 FPS at QHD (2560Γ1440) and 14 FPS at 4K (3840Γ2160) on a single H100, and streams at resolutions where every compared diffusion-based VR baseline runs out of memory.
π arXiv
π Project Page
π GitHub
π License
SwiftVR is a streaming one-step generative video restoration (VR) framework presented in SwiftVR: Real-Time One-Step Generative Video Restoration.
Updates
- [2026/06] Release the inference code and pretrained weights π
β¨ Highlights
- Mask-free shifted-window self-attention (MFSWA). Each spatial window is pre-gathered into a dense tensor, so every attention call reduces to a single standard scaled-dot-product (SDPA) call β no attention mask, cyclic shift, or padding ever enters the graph. This gives a 1.62Γ throughput gain over its full-attention teacher at essentially identical quality, with no dedicated sparse kernel.
- Restoration-aware Autoencoder (ReAE). A lightweight encoderβdecoder jointly fine-tuned with the DiT in pixel space removes the heavy-3D-VAE / tiled-decoding bottleneck.
- Causal chunk-wise streaming. A minimal causal protocol (no rolling KV cache, no overlapped DiT inference) bounds the temporal axis, confining the residual (\mathcal{O}(N^2)) cost to the spatial axes.
π Results
Efficiency at 2560Γ1440 (single H100, causal streaming, 24 frames)
| Metric | DOVE (tile) | SeedVR2-3B (tile) | FlashVSR-Tiny | SwiftVR (Ours) |
|---|---|---|---|---|
| Avg. Time (s) β | 27.615 | 17.320 | 2.493 | 0.766 |
| FPS β | 0.85 | 1.39 | 9.61 | 31.32 |
| Peak Mem. (GB) β | 59.24 | 35.35 | 34.35 | 38.01 |
At 3840Γ2160, every compared diffusion-based VR baseline OOMs on a single H100; SwiftVR sustains 14 FPS.
Qualitative comparison
π SwiftVR teaserπ Installation
git clone https://github.com/H-oliday/SwiftVR.git
cd SwiftVR
conda create -n swiftvr python=3.10 -y
conda activate swiftvr
# Install PyTorch matching your CUDA toolkit first, e.g. CUDA 12.4:
pip install torch==2.10.0 torchvision==0.25.0 --index-url https://download.pytorch.org/whl/cu124
# Install SwiftVR (editable) and its dependencies:
pip install -e .
π Model Zoo
| Model Name | Date | Backbone | Link |
|---|---|---|---|
| SwiftVR | 2026.06 | Wan2.2-TI2V-5B | π€ HuggingFace |
huggingface-cli download H-oliday/SwiftVR --local-dir checkpoints/
Expected checkpoint layout (the directory passed to from_pretrained):
checkpoints/
βββ reae.safetensors # Restoration-aware Autoencoder weights
βββ prompt_embedding.safetensors# precomputed empty-prompt text embedding (key: "prompt_emb")
βββ transformer/ # diffusers-format DiT
βββ config.json
βββ diffusion_pytorch_model.safetensors
π Quick Start
Python API
from swiftvr import SwiftVRPipeline
pipe = SwiftVRPipeline.from_pretrained("H-oliday/SwiftVR").to("cuda", dtype="bfloat16")
pipe.restore_video("low_quality.mp4", "restored.mp4", upscale=4)
restore_video also accepts an image folder as input and can write a PNG sequence with png_save=True.
Tunable knobs include:
clip_len: middle chunk size, multiple of 4dit_overlap: overlap for DiT inferencefps: output video frame ratequality: 0β100, mapped to x265 CRFqueue_size: pipeline queue size
Streaming (causal, chunk by chunk, no future frames)
Causal, chunk-by-chunk restoration without future frames.
session = pipe.stream(clip_len=24, resolution=(1920, 1080))
for lq_chunk in read_chunks("low_quality.mp4", n=24): # lq_chunk: [T, H, W, 3] uint8
hq = session.step(lq_chunk) # [1, T', 3, H', W'] in [0, 1], or None if buffered
if hq is not None:
write(hq)
tail = session.flush() # flush the final buffered frames
Command line
python scripts/inference.py \
--input low_quality.mp4 \
--output restored.mp4 \
--checkpoint checkpoints/ \
--upscale 4 \
--clip-len 24 \
--dtype bfloat16 \
Use --png to write a PNG sequence.
π¬ More Visual Results
Full-length restored clips (low-quality input β SwiftVR, played back to back).
π Acknowledgements
SwiftVR builds on Wan2.2-TI2V-5B, the lightweight autoencoder TAEHV, and the RealBasicVSR degradation pipeline. We thank the authors of DOVE, SeedVR2, and FlashVSR for releasing strong baselines, and the UltraVideo team for the training corpus.
π License
SwiftVR is released under the Apache License 2.0.
Copyright 2026 SwiftVR Authors.
Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at:
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, this project is distributed on an "AS IS" BASIS, without warranties or conditions of any kind, either express or implied. See the LICENSE file for the full license text.
π Citation
@article{yan2026swiftvr,
title={SwiftVR: Real-Time One-Step Generative Video Restoration},
author={Yan, Jiaqi and Chen, Xiangyu and Zhong, Xinlin and Huang, Haibin and Zhang, Chi and Liu, Jie and Zhou, Jiantao and Li, Xuelong},
journal={arXiv preprint arXiv:2606.09516},
year={2026}
}
Contact
If you have any questions, feel free to reach out:
- Email: kakibluee@gmail.com
- Downloads last month
- 420
