Voozh

👁 Image
Hugging Face 👁 Image
GitHub 👁 Image
LinkedIn 👁 Image
irotem98@gmail.com 👁 Image
(+972) 53-432-6592

ControlNet for Diffusion Transformers 🎨

Built a ControlNet-like module for fine-grained text-to-image control, extending ControlNet-XS.
Outperformed Sana’s ControlNet baseline across all metrics.
Injected conditioning with zero-conv layers to preserve pretrained features.
Engineered efficient training with lazy loading & reduced memory footprint.

Developed a VQA pipeline inspired by LLaVA: vision encoder ➜ connector ➜ language model.
Staged training: connector first, then LoRA-fine-tuned the LLM.
Bench-tested SigLIP, MobileCLIP, DINOv2, EfficientSAM for robust visual features.
Added dynamic high-res processing via LLaVA-NeXT + s² wrapper.
Compared Gemma, Qwen, SmolLM, OpenELM for answer quality.

Frame Tokenizer ➜ Latent Action Model ➜ Dynamics Model pipeline.
EfficientVit + MobileStyleGAN compression for ultra-fast tokenization / decoding.
Replaced Genie’s ST-Transformer with a quantized lightweight MLP.

🏆 First place at Samsung Next MobileXGenAI Hackathon — real-time 30 fps face transformations on mobile (CoreML optimised).
Custom encoders inject facial features at multiple StyleGAN decoder layers for detailed, natural edits.
Combined pixel, perceptual, and adversarial losses for robust and identity-preserving results.
Efficient pipeline (MobileStyleGAN + EfficientFormer + CLIP) enables high-quality transformations fully on-device.
Used both w-latents and F-latents for flexible and realistic facial attribute manipulation.
App is fully edge-compatible: minimal memory footprint, no server-side inference needed.