VOOZH about

URL: https://huggingface.co/Backup-bdg/Xoron-Dev-MultiMoe

โ‡ฑ Backup-bdg/Xoron-Dev-MultiMoe ยท Hugging Face


๐Ÿš€ Xoron-Dev: State-of-the-Art Multimodal MoE

๐Ÿ‘ Training-Stage

๐Ÿ‘ Xoron-Dev Logo

Xoron-Dev

โœจ Xoron-Dev: The Elite SOTA Omni-Modal Intelligence

Xoron-Dev is the definitive open-source architecture for Omni-Modal Artificial Intelligence. Unlike legacy models that treat vision and audio as plugins, Xoron-Dev is designed for native, high-fidelity perception across every major sensory dimension.


๐ŸŒŸ Why Xoron-Dev?

Xoron-Dev represents a massive leap in multimodal reasoning, combining cutting-edge Sparse MoE architecture with a refined sensory stack.

1. ๐Ÿ‘๏ธ SOTA Vision (SigLIP-2 & TiTok)

Xoron-Dev exclusively uses SigLIP-2 for superior zero-shot performance and semantic alignment.

  • TiTok 1D VAE: Images are compressed into 256 ultra-dense tokens, allowing Xoron to "see" high-resolution scenes with unprecedented efficiency.
  • 2D-RoPE: Integrated positional embeddings that maintain spatial relationships regardless of aspect ratio.

2. ๐ŸŽฌ Native Video Intelligence (VidTok)

Our custom VidTok encoder uses 3D Volumetric Compression to ingest up to 32 frames of high-definition video natively. Xoron doesn't just see a sequence of imagesโ€”it understands motion, causality, and temporal context.

3. ๐ŸŽ™๏ธ Raw PCM Audio (Conformer + BigVGAN)

Xoron-Dev processes Raw 16kHz PCM Audio directly. No Mel Spectrograms, no lossy Fourier transforms.

  • Micro-Latency S2S: True Speech-to-Speech interactions (<200ms) for natural, fluid conversations.
  • Zero-Shot Voice Cloning: Instantly clone any voice from a 5-second sample for high-fidelity personalized output.

๐Ÿง  The Brain: Aux-Lossless MoE & 128K Ring Attention

A sophisticated Mixture of Experts (MoE) backbone that dynamically routes the logic of every token through specialized hardware-aware sub-networks.

๐Ÿ—๏ธ Deep Expert Hierarchy

Unlike standard MoE models with uniform experts, Xoron-Dev implements a specialized Deep Expert system.

  • Expert Pool: 16 Experts Total (8 Standard + 8 Deep).
  • Variable Logical Depth: Deep Experts possess internal depths scaling from 2 up to 9 layers.
  • Expert Penalty Routing: A soft utilization penalty ($Cost \propto Depth$) ensures that the model only invokes deeper computation for tasks requiring maximum logical precision, maintaining high inference throughput for simpler tokens.

โšก Reasoning Acceleration: Fast Ponder

Xoron-Dev features a dedicated FastPonderBlock for near-instant latent deliberation.

  • Attention-Free Reasoning: By bypassing the $O(N^2)$ Self-Attention stack during thought loops, the Depth-3 reasoning block propagates logic at 120+ thoughts/sec.
  • Dynamic Halting: A learned halt_head monitors latent entropy. Once the model reaches a decision (entropy threshold < 0.2), it breaks the ponder loop and returns to token decoding, reducing unnecessary FLOPs by up to 90%.

๐Ÿ”˜ Infinite Context

Using Ring Attention, Xoron-Dev can analyze books, hour-long videos, or massive codebases with native 128K context window support.


๐Ÿš€ Get Started with Xorfice

The easiest way to experience Xoron-Dev is via the xorfice engineโ€”the SOTA orchestrator for multimodal deployment.

Installation

pip install xorfice

High-Fidelity Interaction

from xorfice import XoronEngine

# The engine automatically handles weights and optimizations
# Correct model slug: Backup-bdg/Xoron-Dev-MultiMoe
engine = XoronEngine(model_path="Backup-bdg/Xoron-Dev-MultiMoe")

# Start an omni-modal conversation
response = engine.generate(
 prompt="Who is this person and what are they doing?",
 images="https://example.com/interview.jpg",
 videos="https://example.com/interview.mp4"
)
print(response["text"])

๐Ÿ“ˆ SOTA Benchmarks & Features

Feature Xoron-Dev
Vision Backbone SigLIP-2
Video Compression VidTok 3D
Audio Ingestion Raw PCM
Inference Efficiency Sparse MoE (5B)
Context Window 128K (Ring)

๐ŸŽจ Creative Generation

Fully integrated with MobileDiffusion, Xoron-Dev doesn't just understandโ€”it creates.

  • Text-to-Video (T2V)
  • Image-to-Video (I2V)
  • Text-to-Image (T2I)
  • Image-to-Image (I2I)
  • Video-to-Video (V2V)

Join the Revolution

Xoron-Dev is more than a modelโ€”it's a vision for the future of AI. Build your own multimodal agent today.

Powered by Xoron-Dev Team

Downloads last month
34
Safetensors
Model size
5B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Datasets used to train Backup-bdg/Xoron-Dev-MultiMoe