Wan 2.6 Multi-shot Storytelling: Shot List + Prompt Formula + Reference Video Workflow

Over the last month I've been testing Wan 2.6's multi‑shot prompt flow to see if I can keep characters, lighting, and on‑screen text consistent across a short sequence. If you've been looking for a clear, repeatable way to write a Wan 2.6 multi shot prompt that actually holds together, this is my field guide. I'll show my steps, the phrases I reuse, and where multi‑shot breaks so you can fix it fast.

What Is Wan 2.6 Multi-shot Video?

Multi‑shot video generation in Wan 2.6 lets you define several sequential shots in one project so the model tries to maintain character identity, wardrobe, lighting, and set dressing across cuts. Think of it as story blocks: Shot 1 (establish), Shot 2 (action), Shot 3 (reaction/CTA). Instead of a single, pretty clip, you get a small scene with continuity.

What it's not: it's not perfect continuity. No model is. Diffusion models sample frames from noise using a guidance signal (CFG/guidance scale) and a seed. If the seed changes or the prompt drifts, small features wander, typography, earrings, even a logo angle. So the job is to give the model rails: a shot list, connected prompts, and, when needed, a reference image or reference video.

Where Wan 2.6 is strong in my tests: facial stability over 2–4 shots, consistent wardrobe colors, and lighting direction when I explicitly lock it. Where it's fragile: precise text on props (labels, screen UIs) and tiny brand marks. For marketing work, I treat the sequence as the canvas and the final frame text as a compositing step in editing if it won't hold.

Planning Your Shot List

Before I ever write a Wan 2.6 multi shot prompt, I sketch the sequence on paper. It sounds slow: it makes the generation faster. Here's how I block it.

Story Structure (Beginning–Middle–End)

I keep it simple:

Beginning (setup): Where are we? Who are we with? What's the mood? 2–3 seconds.
Middle (action/change): Product use, a key gesture, or a reveal. 3–5 seconds.
End (reaction/CTA): Expression, result, or text overlay. 2–3 seconds.

Shot Types (Wide, Medium, Close)

Shot 1: Wide to establish space and lighting direction. I state the key light position in words: "warm key light from camera left, soft practicals in background."
Shot 2: Medium for action. Keep the camera on the same side to preserve screen direction.
Shot 3: Close for emotion or details. I repeat wardrobe and prop descriptors exactly.

These labels matter because the model maps "wide/medium/close" to framing patterns. Consistency language reduces drift.

Shot List Template

I keep a compact template I can paste into Wan 2.6:

Shot 1 (2s): Wide. Location, time of day, lighting direction, wardrobe summary, character name.
Shot 2 (3s): Medium. Action verb, hands/props, same lighting direction, camera movement.
Shot 3 (2s): Close. Expression, the most important object, background blur level, CTA.

Example shot list for a coffee ad:

Shot 1: Wide, sunlit kitchen, morning. Warm key from left, cool fill from window. Character: "Mia, late‑20s, curly brown bob, olive sweater."
Shot 2: Medium. Mia pours coffee into a glass mug: steam visible. Same lighting: slow push‑in.
Shot 3: Close. Steam swirls: Mia smiles. "Text on screen: Freshly brewed." Simple sans‑serif.

Writing Connected Prompts

If multi‑shot fails, it's usually because the prompts don't share anchors. I write each shot's prompt as a variation of the same base string, with only the action and framing changing.

Character Consistency Keywords

Use the same name. I literally add a fake name: "Character: Mia."
Repeat hair, wardrobe, and signature prop words verbatim across shots: "curly brown bob, olive sweater, clear glass mug."
Lock palette words: "muted neutrals, warm highlights."

I avoid synonyms mid‑sequence. If Shot 1 says "olive sweater," Shot 3 does not say "green knit." Diffusion treats new words as new possibilities.

Scene Continuity Phrases

These are phrases I paste into every shot:

"Same kitchen set, same time of day."
"Warm key light from camera left: practical lights in background."
"Camera remains on screen‑right side: maintain screen direction."
"Background objects persistent: white tile, wooden counter, silver kettle."

For AI images with accurate text on screen, I also anchor the font: "simple sans‑serif, high contrast, center‑lower third." The model isn't a typesetter, but it tries. If exact typography matters, I render text later.

Transition Descriptions

Describe what links shots:

"Cut from wide to medium: maintain steam continuity."
"Slow push‑in continues in Shot 2."
"Match cut on hand movement."

Connected transitions reduce abrupt re‑layouts between shots. It reads like director's notes, and Wan 2.6 seems to respect that structure better than loose prose. This is a small thing, but it's saved me time, which is the real currency when you're an indie creator.

Using Reference Videos

Reference videos act like training wheels. When I upload a short reference (3–5 seconds), Wan 2.6 borrows motion cues, pacing, and sometimes camera rhythm without copying subject identity. Here's what they do well:

What Reference Videos Do

Stabilize motion: a gentle dolly or hand movement carries across shots if the reference suggests it.
Reinforce lighting feel: if the reference is backlit and warm, the model leans that way.
Anchor prop physics: steam, pouring, hair sway are more believable.

Limits: references won't guarantee brand‑safe logos or pixel‑perfect label text. If your priority is readable labels, generate the cleanest motion and composite text in editing. That's still the fastest path to production for AI tools for designers under deadline.

Multi-shot Workflow (Step-by-Step)

Here's my exact flow for a Wan 2.6 multi shot prompt that holds together.

Step 1: Script & Shot List

Write a 2–3 sentence script describing the arc.
Fill the shot list template (wide/medium/close, duration, action).
Define character anchors (name, hair, wardrobe, prop) and lighting anchor ("warm key left, cool fill right").
Plan any on‑screen text. If it must be perfect, mark it for editing.

I keep my technical settings nearby. For most sequences: 5–7 seconds total, 24 fps, CFG/guidance scale 5–7, and a fixed seed per sequence. Lower guidance sometimes helps natural motion: higher guidance helps hold identity.

Step 2: Generate Each Shot

I generate shot by shot, not all at once, so I can fix drift early. Example prompts:

Shot 1 prompt:

"Shot 1, wide. Kitchen, morning. Character: Mia, late‑20s, curly brown bob, olive sweater. Warm key light from camera left: practicals glowing in background. Muted neutrals, clean counter, white tile, silver kettle. Calm atmosphere. Duration 2s, static camera. Keep seed 12345. Guidance 6. Negative: distorted text, extra hands."

Shot 2 prompt:

"Shot 2, medium. Same kitchen set, same time of day. Mia pours coffee into a clear glass mug: visible steam. Camera remains screen‑right: slow push‑in begins. Warm key from camera left: same wardrobe and hair. Duration 3s. Keep seed 12345. Guidance 6."

Shot 3 prompt:

"Shot 3, close. Same set and wardrobe. Steam swirls: Mia smiles. Background softly blurred. On‑screen text: Freshly brewed, simple sans‑serif, center‑lower third, high contrast. Duration 2s. Keep seed 12345. Guidance 6. Negative: warped letters."

If Wan 2.6 supports a multi‑segment panel, I stack those prompts as segments and tick "lock seed across shots." If not, I render separately and align in the editor.

Step 3: Review Consistency

I check three things:

Identity: eyes, hairline, sweater texture. If it drifts, I raise guidance slightly or add a single reference frame from Shot 1.
Lighting: does the key still come from left? If not, I add "strong left‑side highlights, right‑side soft shadows."
Text: is it legible? If no, I stop trying to fix it in‑model and plan to composite. That's fastest.

For realism, I also look at hand‑to‑prop contact. A tiny mismatch is acceptable for social ads. If it's distracting, I regenerate Shot 2 with a shorter action description like "lifts mug, gentle steam," which is easier for the model.

Step 4: Edit & Assemble

I bring all shots into a timeline, trim to beats, match color between shots (warm up mids, keep highlights consistent), and add the on‑screen text as vector where necessary. This hybrid approach is how I get near‑perfect results quickly, the real "best AI image generator for text" is sometimes your editor.

If audio matters, I add a soft coffee pour SFX and a two‑note music sting on the CTA. Small, human touches sell the sequence, and they're cheap timewise.

Common Multi-shot Mistakes

Inconsistent Character Descriptions

Switching adjectives mid‑sequence ("olive sweater" to "green knit") invites drift. Reuse the same string every time. If you need variation (like a jacket on/off), declare it explicitly: "same wardrobe, jacket removed." Also, lock accessories: "no earrings," or you'll get surprise jewelry.

Ignoring Lighting Continuity

I used to forget to restate light direction in every shot. When I don't repeat "warm key from camera left," the key flips. That break is jarring. Bake direction, color temperature words ("warm," "cool"), and ambience ("practicals glowing") into each prompt. It's boring: it works.

3 Multi-shot Templates

Here are three copy‑pasteable templates I've used. Tweak nouns, keep the anchors.

Template 1: Product Ad (Hook–Demo–CTA)

Shot 1 (Hook, 2s): "Wide. Clean desk by a window, morning. Character: Alex, 30s, short black hair, white tee. Product: matte black smartwatch on wrist. Warm key from camera left: soft daylight fill. Calm, premium mood. Duration 2s. Guidance 6. Negative: warped logos."
Shot 2 (Demo, 3s): "Medium. Same set/time. Alex lifts wrist: watch screen wakes. Camera slow push‑in. On‑screen text attempt: 7‑day battery. Keep seed. Guidance 6. Maintain screen direction."
Shot 3 (CTA, 2s): "Close. Watch fills frame at 3/4 angle: soft bokeh. Add crisp overlay text in editor: ‘Ready when you are.' Simple sans‑serif, lower third."

Notes: Keep the brand mark clean in editing. For realistic AI images for marketing, I rarely trust in‑model logos.

Template 2: Mini Tutorial

Shot 1 (Intro, 2s): "Wide. Kitchen counter, noon. Character: Priya, shoulder‑length wavy hair, blue apron. Warm key from camera left. ‘We're making iced tea.' Duration 2s."
Shot 2 (Step, 3s): "Medium top‑down. Same set. Pour tea over ice: lemon slice drops. Keep seed. Guidance 5 for natural motion. Negative: extra fingers."
Shot 3 (Result, 2s): "Close. Condensation on glass: crisp detail. On‑screen steps appear as clean overlay in editor."

Notes: Top‑down shots can change identity cues: restate wardrobe and hair even if not visible to stabilize the model.

Template 3: Micro Drama

Shot 1 (Setup, 2s): "Wide. Rainy street at night. Character: Sam, 40s, stubble, dark coat, red scarf. Neon reflections, cool blue ambience, warm key from shop window camera left."
Shot 2 (Turn, 3s): "Medium. Same street/time. Sam notices a lost glove on a bench: reaches for it. Keep seed: slow push‑in continues."
Shot 3 (Beat, 2s): "Close. Sam smiles, picks up glove: raindrops on scarf. Background bokeh neon. Optional subtle subtitle added in editor."

If you need AI images with accurate text in Shot 3 (a store sign, for example), I duplicate the last frame and patch text as vector. Fast, clean, client‑safe.

One last thing, if Wan 2.6 adds better text controls tomorrow, I'll test them and update my wording. Until then, this combo of anchored prompts, steady lighting, and light editing is what gets me production‑ready results without a week of trial‑and‑error. If you want to prototype consistent visuals or lock down readable text before running multi-shot video, I often start with stills first. Tools like Z-Image.ai are useful here — it’s fast, free, and good at holding typography and style while you test concepts.