Gemini Omni
記事一覧に戻る
8 分で読了

How to Prompt Gemini Omni: A Practical Guide to Multimodal AI Video Prompts

A working prompt framework for Google's leaked Gemini Omni model. Persona, task, format, context — plus camera, audio and reference assets — all in one brief.

Gemini OmniPrompt EngineeringAI VideoBest Practices

Why prompting Omni is different

Most AI video prompts written in 2024–2025 were designed for specialised, short-context video models. You wrote one sentence, picked a style preset, hit generate. With Gemini Omni — Google’s leaked unified multimodal model — the prompt is doing much more work. The same single prompt is steering text, image, video and synchronised audio inside a long context window.

That shifts the prompt from “describe the scene” to “describe the entire deliverable.” This guide is a working framework for getting the most out of Omni once it lands, with techniques borrowed from Google’s official prompting guidance and the leaked Omni preview cards.

The four-part framework: Persona · Task · Format · Context

Google’s broader prompting playbook for the Gemini family recommends four building blocks:

  1. Persona — the expertise you want the model to draw from (“act as a cinematographer”, “as a brand designer”, “as a documentary editor”).
  2. Task — what you want produced (“a 10-second hero shot of the new headphones”, “a 9:16 product reveal”).
  3. Format — the structural constraints (“16:9, 1080p, slow tracking shot, golden-hour lighting”).
  4. Context — the brand, audience and reference material the model should pull from.

For Omni, this maps directly onto a clean brief structure:

You are [PERSONA].
Generate [TASK].
Format: [aspect ratio, duration, resolution, camera language, lighting].
Context: [brand voice, audience, references, audio cues].

A real example:

You are a luxury cinematographer in the vein of Wong Kar-wai. Generate a 10-second hero shot of a matte-black wireless headphone resting on a textured concrete plinth. Format: 16:9, 1080p, slow 35mm tracking shot from camera-left to camera-right, soft golden-hour back-lighting, shallow depth of field. Context: brand is minimalist Scandinavian premium audio. Audio: low atmospheric drone with a single subtle bell strike at 0:07 when the camera passes the brand mark. Reference image: see attached product photo for exact colour and stitching.

The three Cs: Concise, Clear, Consistent

Google’s own prompting reference guide stresses three principles that translate cleanly to Omni:

  • Concise. Long does not equal good. Strip filler words. Keep one main subject and one main action per prompt.
  • Clear. Avoid ambiguous descriptors like “make it better” or “more cinematic.” Replace with concrete instructions: “increase depth of field”, “warmer colour temperature”, “slower camera move at 0.5x speed.”
  • Consistent. Use the same vocabulary for the same concepts across iterations. If you call it a “tracking shot” once, do not switch to “dolly move” later — the model treats those as different signals.

Lean into long-context, layered prompts

Unlike short-context video models, Omni inherits Gemini’s long context window. That means you can — and should — write layered, descriptive prompts. A productive brief covers:

  • Subject: who or what is in frame, including identity-locking references.
  • Mood: emotional register and pacing.
  • Camera: lens, movement, framing changes within the clip.
  • Lighting: source, direction, colour temperature, contrast.
  • Dialogue: any spoken lines, with lip-sync timing if relevant.
  • Sound design: ambient bed, music genre, key sound cues with timecodes.
  • Brand or stylistic context: references to existing work or visual language.

You are essentially writing a single-page treatment, not a sentence. Omni’s long context is built for this.

Use reference assets aggressively

The leaked Omni feature list explicitly highlights reference inputs: images, video clips and audio tracks can all be combined in a single instruction. Concrete uses:

  • Character lock: attach a reference image of the protagonist to keep them consistent across multiple omni-clips.
  • Style lock: attach a frame from an existing piece of work to anchor colour grade and composition.
  • Motion lock: attach a short reference video to mimic a camera move or character action.
  • Beat lock: attach a music track and ask Omni to cut visuals to the beat (especially useful for Reels and music videos).

Reference assets carry far more signal per byte than text alone. A 30-word prompt with three reference images will almost always outperform a 300-word prompt with no references.

Edit in-chat instead of regenerating

The single biggest workflow shift Omni is rumoured to introduce is direct in-chat editing. Rather than regenerating an entire clip when one element is wrong, you can ask:

“Swap the watch on the model’s wrist for a brushed silver chronograph. Keep all other framing, lighting and audio exactly the same.”

“Slow the camera move by 30 % and warm the colour temperature by 200 K.”

“Remove the bell strike at 0:07 and add a soft ambient swell from 0:08 to 0:10 instead.”

This mirrors how Nano Banana redefined the image editing experience in 2025. The implication for prompt craft is significant: your first prompt no longer needs to be perfect. Generate a strong base, then steer it. That pattern is also cheaper in compute terms than constant regeneration.

Five Omni-specific prompt patterns to copy

A starter pack of patterns that map well to the model’s strengths:

1. Product hero

Generate a [duration] [aspect-ratio] hero shot of [product], [lighting], [camera move]. Audio: [ambient bed] with [signature sound] at [timecode]. Reference: [attach product photo].

2. Reel / Short with on-mic dialogue

9:16, [duration]. Subject delivers the line “[short copy]” directly to camera in a [setting]. Lip-sync precise. Background ambient: [environment sound]. Match the rhythm of [reference audio].

3. Music video cut

Generate [duration] of [subject] performing [action] to the attached music track. Cut visuals on the beat. Maintain character consistency across the clip. Lighting follows the track’s energy curve.

4. Cinematic short building block

10-second omni-clip: [subject] [action] in [environment]. Continuous [lighting setup]. Hold the audio bed across the cut so this clip can be chained with the previous one (attached).

5. Conversational edit

Take the previous generation and [specific change]. Keep [list of preserved elements] unchanged. Confirm the change took effect on [specific frame or timecode].

What to test on day one

When you finally get hands-on with Omni, four tests will tell you most of what you need to know:

  1. Text rendering on screen — does writing on a chalkboard or sign stay legible across the full clip?
  2. Lip sync on spoken dialogue — does the model land mouth shapes inside one generation?
  3. Multi-clip continuity — chain two 10-second omni-clips and check that characters, lighting and audio bed actually persist.
  4. Reference fidelity — does a reference image lock character identity, or only suggest it?

If Omni nails three of those four, your prompt library is suddenly more valuable than your tool stack. Plan accordingly.