Gemini Omni
Leaking · Google I/O 2026

Gemini Omni
One model for text, image, video and audio

Surfacing in early May 2026 across multiple leaks, Gemini Omni is Google's upcoming unified multimodal model: native generation of text, image, video and synced audio inside a single Gemini-trained system.

Unified model Synced audio In-chat editing
Omni
Text
Image
Video
Audio

Quick stats

5–10s Clip length
1080p Max output
16:9 · 9:16 · 1:1 Aspect ratios
I/O 2026 Expected reveal
Capabilities

The whole pipeline collapses into one model

Unlike specialised video models such as Veo, Sora 2, Seedance 2.0 or Kling, Gemini Omni keeps language reasoning, image generation, video generation and audio synthesis under one architecture.

Native multimodal output

A single prompt produces matching text, keyframes and video, with consistent characters, style and lighting carrying across formats.

One unified Gemini stack

No more chaining of specialised models. Text, image, video and audio share the same weights and the same long context.

Synced native audio

Ambient sound, score and dialogue are aligned with the picture in the same forward pass — footsteps land on the beat, lips match speech on first export.

Direct in-chat editing

Swap an object, change the lighting, adjust a camera move in natural language — no full regeneration, echoing the Nano Banana editing playbook.

Remix and steer

Upload an existing clip and redirect it with prompts. Reference images, videos and audio can be combined in a single instruction.

Templates & styles

Built-in templates for product ads, Reels, music videos and cinematic shorts lower the floor for first-time users while keeping camera language consistent.

Specs

What can be pieced together before the keynote

Numbers below are aggregated from Reddit/X leaks and reporting by TestingCatalog, Programming Insider and OfficeChai.

Dimension Known signal
Model family Google Gemini — successor branding for the Veo line
Model ID bard_eac_video_generation_omni / v3smm-lora-prod
Clip length 5 / 8 / 10 seconds per generation, chainable in-app
Resolution 480p / 720p / 1080p
Aspect ratios 16:9, 9:16, 1:1
Audio Natively synthesized, synced in a single pass
Inputs Text / image / video / audio references
Access Staging inside Gemini app, API expected post I/O
Quota signal Reports say two Omni generations burn ~86% of an AI Pro daily quota
Architecture

Three product lines collapse into one Omni

Google's generative stack used to be split across Veo for video, Nano Banana / Imagen for image and Gemini for text. Omni rolls those into a single architecture.

Before

Veo 3.1

Video + native audio

Nano Banana / Imagen

Image generation & editing

Gemini 2.5 / 3.x

Reasoning · long context

Now · Omni

Gemini Omni

Text · image · video · audio, one model, one prompt

Text Image Video Audio
Use cases

From a single brief to publishable content

A unified model with long context and synced audio means teams can write one coherent brief and walk away with a finished cut.

01

Product ads

Hero shots, packaging reveals and lifestyle cuts shipped with ambient audio already locked.

02

Reels & Shorts

Vertical 9:16 clips with on-mic dialogue and beat-synced motion, built for scroll-stopping social.

03

Music videos

Reference a track and Omni cuts visuals to the beat, keeping a consistent character across shots.

04

Cinematic shorts

Chain multiple 10-second omni-clips into multi-shot sequences with continuous lighting and audio bed.

05

Landing-page hero loops

Loopable 16:9 atmospheric clips for SaaS, fashion and DTC sites — branded and silent-friendly.

06

Explainers & tutorials

Turn a script into a narrated sequence with lip-synced dialogue and matching ambient sound.

Compare

Where Omni sits in the 2026 video stack

Aggregated from Artificial Analysis, Looksy AI, Oimi AI and the official keynotes — for orientation, not benchmark scores.

Model Maker Architecture Native audio Clip length
Gemini Omni Omni
Google Unified omni (video + image + audio) Synced in one pass 5 / 8 / 10s
Veo 3.1
Google Specialised video model Yes ~8s
Seedance 2.0
ByteDance Specialised multi-modal video Yes up to 15s / shot
Sora 2
OpenAI Specialised video model Yes ~20s
Kling V3.0
Kuaishou Specialised video model Limited ~10s
Timeline

From the first leak to the I/O 2026 stage

Ordered by public report date, still evolving.

  1. 2026 · 05 · 02

    First "Powered by Omni" string

    X user @Thomas16937378 spotted "Start with an idea or try a template. Powered by Omni." inside the Gemini video tab.

  2. 2026 · 05 · 11

    Full preview card inside Gemini mobile

    TestingCatalog and Chetaslua surfaced the "Meet our new video model" card, the full model ID and the 10-second clip cap.

  3. 2026 · 05 · 12 – 18

    Demos circulate in the wild

    A "professor solving trig on a chalkboard" clip showcased text coherence and physical fidelity, sparking heavy comparison with Veo 3.1.

  4. 2026 · 05 · 19 – 20

    Expected unveil at Google I/O 2026

    Main-stage time is widely expected for Omni, possibly alongside Flash / Pro tiering, an API, and reshuffled subscription tiers.

FAQ

The questions people ask most about Gemini Omni

What exactly is Gemini Omni?

It's Google's upcoming unified multimodal model that natively generates text, image, video and synced audio inside one architecture — effectively merging Veo, Imagen and Gemini.

When will it ship?

As of mid-May 2026 Omni is still in the leak phase. The widely expected reveal is the Google I/O 2026 main stage on May 19–20.

How does it relate to Veo 3.1?

Metadata suggests Omni inherits engineering from the Veo stack, but it drops the Veo brand and folds video into Gemini's text and image layers.

Does it really generate sound?

Yes. Ambient sound, score and dialogue are produced in the same pass as the video — that's the whole reason for the 'omni' name.

What is the current clip-length limit?

The leaked model ID points to 5, 8 or 10 seconds per generation, with multi-clip chaining at the client layer.

How will pricing work?

Unconfirmed. A Reddit screenshot shows two Omni generations burning ~86% of the AI Pro daily quota, so a higher 'Ultra / Pro Plus' tier is plausible.