VOOZH about

URL: https://huggingface.co/ICTNLP/stream-omni-8b

⇱ ICTNLP/stream-omni-8b Β· Hugging Face


Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model

πŸ‘ arXiv
πŸ‘ arXiv
πŸ‘ model
πŸ‘ data
πŸ‘ Badge

Shaolei Zhang, Shoutao Guo, Qingkai Fang, Yan Zhou, Yang Feng*

The introduction and usage of Stream-Omni refer to https://github.com/ictnlp/Stream-Omni.

Stream-Omni is an end-to-end language-vision-speech chatbot that simultaneously supports interaction across various modality combinations, with the following featuresπŸ’‘:

  • Omni Interaction: Support any multimodal inputs including text, vision, and speech, and generate both text and speech responses.
  • Seamless "see-while-hear" Experience: Simultaneously output intermediate textual results (e.g., ASR transcriptions and model responses) during speech interactions, like the advanced voice service of GPT-4o.
  • Efficient Training: Require only a small amount of omni-modal data for training.

πŸ‘ stream-omni

πŸ–₯ Demo

Microphone Input File Input

Stream-Omni can produce intermediate textual results (ASR transcription and text response) during speech interaction, offering users a seamless "see-while-hear" experience.

Downloads last month
22
Safetensors
Model size
13B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 2 Ask for provider support

Space using ICTNLP/stream-omni-8b 1

Paper for ICTNLP/stream-omni-8b