Native multimodal output
A single prompt produces matching text, keyframes and video, with consistent characters, style and lighting carrying across formats.
Surfacing in early May 2026 across multiple leaks, Gemini Omni is Google's upcoming unified multimodal model: native generation of text, image, video and synced audio inside a single Gemini-trained system.
Unlike specialised video models such as Veo, Sora 2, Seedance 2.0 or Kling, Gemini Omni keeps language reasoning, image generation, video generation and audio synthesis under one architecture.
A single prompt produces matching text, keyframes and video, with consistent characters, style and lighting carrying across formats.
No more chaining of specialised models. Text, image, video and audio share the same weights and the same long context.
Ambient sound, score and dialogue are aligned with the picture in the same forward pass — footsteps land on the beat, lips match speech on first export.
Swap an object, change the lighting, adjust a camera move in natural language — no full regeneration, echoing the Nano Banana editing playbook.
Upload an existing clip and redirect it with prompts. Reference images, videos and audio can be combined in a single instruction.
Built-in templates for product ads, Reels, music videos and cinematic shorts lower the floor for first-time users while keeping camera language consistent.
Numbers below are aggregated from Reddit/X leaks and reporting by TestingCatalog, Programming Insider and OfficeChai.
| Dimension | Known signal |
|---|---|
| Model family | Google Gemini — successor branding for the Veo line |
| Model ID | bard_eac_video_generation_omni / v3smm-lora-prod |
| Clip length | 5 / 8 / 10 seconds per generation, chainable in-app |
| Resolution | 480p / 720p / 1080p |
| Aspect ratios | 16:9, 9:16, 1:1 |
| Audio | Natively synthesized, synced in a single pass |
| Inputs | Text / image / video / audio references |
| Access | Staging inside Gemini app, API expected post I/O |
| Quota signal | Reports say two Omni generations burn ~86% of an AI Pro daily quota |
Google's generative stack used to be split across Veo for video, Nano Banana / Imagen for image and Gemini for text. Omni rolls those into a single architecture.
Before
Veo 3.1
Video + native audio
Nano Banana / Imagen
Image generation & editing
Gemini 2.5 / 3.x
Reasoning · long context
Now · Omni
Gemini Omni
Text · image · video · audio, one model, one prompt
A unified model with long context and synced audio means teams can write one coherent brief and walk away with a finished cut.
Hero shots, packaging reveals and lifestyle cuts shipped with ambient audio already locked.
Vertical 9:16 clips with on-mic dialogue and beat-synced motion, built for scroll-stopping social.
Reference a track and Omni cuts visuals to the beat, keeping a consistent character across shots.
Chain multiple 10-second omni-clips into multi-shot sequences with continuous lighting and audio bed.
Loopable 16:9 atmospheric clips for SaaS, fashion and DTC sites — branded and silent-friendly.
Turn a script into a narrated sequence with lip-synced dialogue and matching ambient sound.
Aggregated from Artificial Analysis, Looksy AI, Oimi AI and the official keynotes — for orientation, not benchmark scores.
| Model | Maker | Architecture | Native audio | Clip length |
|---|---|---|---|---|
| Gemini Omni
Omni
| Unified omni (video + image + audio) | Synced in one pass | 5 / 8 / 10s | |
| Veo 3.1 | Specialised video model | Yes | ~8s | |
| Seedance 2.0 | ByteDance | Specialised multi-modal video | Yes | up to 15s / shot |
| Sora 2 | OpenAI | Specialised video model | Yes | ~20s |
| Kling V3.0 | Kuaishou | Specialised video model | Limited | ~10s |
Ordered by public report date, still evolving.
X user @Thomas16937378 spotted "Start with an idea or try a template. Powered by Omni." inside the Gemini video tab.
TestingCatalog and Chetaslua surfaced the "Meet our new video model" card, the full model ID and the 10-second clip cap.
A "professor solving trig on a chalkboard" clip showcased text coherence and physical fidelity, sparking heavy comparison with Veo 3.1.
Main-stage time is widely expected for Omni, possibly alongside Flash / Pro tiering, an API, and reshuffled subscription tiers.
It's Google's upcoming unified multimodal model that natively generates text, image, video and synced audio inside one architecture — effectively merging Veo, Imagen and Gemini.
As of mid-May 2026 Omni is still in the leak phase. The widely expected reveal is the Google I/O 2026 main stage on May 19–20.
Metadata suggests Omni inherits engineering from the Veo stack, but it drops the Veo brand and folds video into Gemini's text and image layers.
Yes. Ambient sound, score and dialogue are produced in the same pass as the video — that's the whole reason for the 'omni' name.
The leaked model ID points to 5, 8 or 10 seconds per generation, with multi-clip chaining at the client layer.
Unconfirmed. A Reddit screenshot shows two Omni generations burning ~86% of the AI Pro daily quota, so a higher 'Ultra / Pro Plus' tier is plausible.
Everything on this page is aggregated from the public sources below. Cross-reading is recommended.
Leak details, UI strings and early demo analysis.
Speculation on architecture and side-by-side with Seedance / Veo.
Full model ID, in-app prompts and community reactions.
Tidy summary of specs, use cases and comparisons.
Family-level multimodality, long context and the agentic direction.