VideoMemory: Toward Consistent Video Generation via Memory Integration

Jinsong Zhou^1,3*, Yihua Du^1*, Xinli Xu^1*†, Luozhou Wang¹, Zijie Zhuang¹, Yehang Zhang¹, Shuaibo Li¹, Xiaojun Hu³, Bolan Su³, Ying-Cong Chen^1,2‡

¹HKUST(GZ) ²HKUST ³ByteDance

^*Equal contribution ^†Project leader ^‡Corresponding author

arXiv Code (Coming Soon)

From a single story prompt, VideoMemory generates coherent multi-shot videos using dynamic Character, Prop, and Background Memory Banks. This ensures exceptional entity consistency, as seen with the feather prop remaining perfectly stable across distant shots (e.g., 2, 10, and 12) despite significant scene and viewpoint variations.

Please enable sound for the best experience

The first half of this video illustrates the working mechanism of VideoMemory, while the second half showcases a complete case result.

Abstract

Maintaining consistent characters, props, and environments across multiple shots is a central challenge in narrative video generation. Existing models can produce high-quality short clips but often fail to preserve entity identity and appearance when scenes change or when entities reappear after long temporal gaps. We present VideoMemory, an entity-centric framework that integrates narrative planning with visual generation through a Dynamic Memory Bank. Given a structured script, a multi-agent system decomposes the narrative into shots, retrieves entity representations from memory, and synthesizes keyframes and videos conditioned on these retrieved states. The Dynamic Memory Bank stores explicit visual and semantic descriptors for characters, props, and backgrounds, and is updated after each shot to reflect story-driven changes while preserving identity. This retrieval–update mechanism enables consistent portrayal of entities across distant shots and supports coherent long-form generation. To evaluate this setting, we construct a 54-case multi-shot consistency benchmark covering character-, prop-, and background-persistent scenarios. Extensive experiments show that VideoMemory achieves strong entity-level coherence and high perceptual quality across diverse narrative sequences.

Contributions

We introduce VideoMemory, an entity-centric framework for script-to-multi-shot video generation built around a Dynamic Memory Bank that explicitly tracks and updates the visual and semantic states of narrative entities.
We develop a retrieval–and-update memory mechanism with dedicated slots for characters, props, and environments, enabling long-range entity consistency while allowing story-guided changes in their visual state.
We construct a multi-shot consistency benchmark with 54 story-driven cases covering character-, prop-, and background-persistent scenarios, providing the first structured evaluation protocol for long-range entity coherence in narrative videos.

Method & Pipeline

The framework of the proposed VideoMemory. Starting from a script synopsis, our system plans shot-level descriptions, interacts with a Dynamic Memory Bank to retrieve or create entity references, generates keyframes, and finally synthesizes a coherent multi-shot video.

Algorithm: VideoMemory Generation Pipeline

Input: Script synopsis S

Output: Multi-shot video V̂ = {V̂_i}_i=1^N

                1:
                
                  {Ci}i=1N ← 
                  StoryboardAgent(S)
                
                2:
                
                  Initialize memory banks Mchar, Mprop, Mbg ← ∅
                
                3:
                
                  for shot i = 1 to N do
                  ▷ Process shots sequentially
                
                4:
                
                  {(eij, aij, cij)}j ← 
                  MemoryAgentAnalyze(Ci)
                
                5:
                
                  for each (eij, aij, cij) do
                
                6:
                
                  (Iijref, M) ← 
                  RetrieveOrGenerate(eij, aij, cij, M)
                
                7:
                
                  end for
                
                8:
                
                  Iikey ← 
                  VisualizationAgentKeyframe(Ci, {Iijref}j)
                
                9:
                
                  V̂i ← 
                  I2VModel(Iikey, Ci)
                
                10:
                
                  end for
                
                11:
                
                  V̂ ← {V̂i}i=1N

Notation:

S: script synopsis

N: number of shots

V̂: generated multi-shot video

V̂_i: video for shot i

C_i: caption for shot i

M^char, M^prop, M^bg: memory banks

e_ij: name of entity j in shot i

a_ij: attributes of entity j in shot i

c_ij: entity category of entity j in shot i

I_ij^ref: reference image for entity j in shot i

I_i^key: keyframe for shot i

Qualitative Results

Qualitative comparison demonstrating superior entity consistency. Across all three subclasses (Character, Prop, Background), VideoMemory (bottom row) maintains remarkable stability where baselines fail. Note how baselines exhibit severe identity drift—changing a character's appearance (left), morphing a red kite into other objects (middle), and altering a garage's layout (right). In contrast, our method preserves the identity of all entities across distant shots, a direct result of our explicit memory management.

Quantitative Results

Multi-shot consistency results. We evaluate character, prop, and background consistency using DINOv2 similarity. Our method achieves superior performance across all metrics, especially as the number of shots increases. Best and second-best scores are marked in bold and underlined, respectively.

	Character Consistency↑				Prop Consistency↑				Background Consistency↑
Shot Number	4	8	12	Avg.	4	8	12	Avg.	4	8	12	Avg.
Wan2.2	0.34	-	-	-	0.48	-	-	-	0.25	-	-	-
EchoShot	0.45	0.44	-	-	0.59	0.51	-	-	0.54	0.37	-	-
IC-LoRA+Wan2.2	0.42	0.55	0.43	0.47	0.50	0.44	0.34	0.43	0.31	0.33	0.29	0.31
StoryDiffusion+Wan2.2	0.53	0.62	0.46	0.54	0.43	0.47	0.52	0.47	0.51	0.40	0.36	0.42
VGoT+Wan2.2	0.59	0.53	0.60	0.57	0.48	0.22	0.24	0.31	0.53	0.36	0.47	0.45
VideoMemory (Ours)	0.61	0.65	0.64	0.63	0.69	0.50	0.55	0.58	0.71	0.72	0.73	0.72

Video Showcase

Multi-shot video generation results demonstrating consistent entity preservation across diverse narratives.

Please enable sound for the best experience

Case 1 - THE EDGE OF FOCUS

Shot 1:

Inside a traditional dojo in late afternoon light, dust motes drift through golden beams cutting across dark wood. A well-worn BAMBOO SHINAI is gripped so tightly the leather creases; knuckles pale with effort. KENJI holds a low stance, unmoving, eyes burning behind the helmet grill.

Shot 2:

Across the polished floor, RYU matches Kenji's posture with calm precision. He stands taller, expression contained behind his own grill, shinai angled with disciplined restraint. The space between them feels heavy, like the dojo itself is holding its breath, amplifying every tiny shift of weight and every quiet scrape of bamboo against callused hands.

Shot 3:

Kenji's foot slides forward—small, deliberate, and final. The stillness breaks as he surges into motion, body turning into a streak of kinetic force. Ryu meets him without hesitation, timing perfect; their shinais collide and the impact blooms into a burst of blue energy that ripples through the air like a shockwave.

Shot 4:

They trade strikes in a relentless rhythm—attack, parry, counter, evade—each movement sharper than the last. Kenji's focus tightens to a fierce edge, his eyes wide and fully locked in. Ryu pivots with controlled power, shinai carving bright arcs through the space; bamboo bites bamboo, and sparks flicker where the blades grind and slide past each other.

Shot 5:

The duel compresses into raw strength as they press shinais together, both refusing to yield. The bamboo bows under pressure, fibers straining. Their faces hover inches apart behind the protective grills, breath harsh and ragged, teeth clenched, forearms trembling as the stalemate threatens to snap one of them open.

Shot 6:

The dojo's walls dissolve away, replaced by a swirling void of brushstrokes and racing cherry-blossom petals. Color saturates into something unreal and emotional. Kenji and Ryu keep fighting as if the space itself is an extension of their concentration—movements so fast they seem to appear and vanish between strikes, the duel continuing without any ground, horizon, or limit.

Shot 7:

Ryu launches upward into the open emptiness, framed against a massive stylized sun that dominates the painted sky. He drives a crushing overhead strike downward with absolute commitment. Kenji answers with a desperate horizontal block, arms braced, stance anchored by will alone. The collision detonates into a blinding white flash that swallows everything for a beat.

Shot 8:

The dojo is there again, the late-afternoon light now darker and heavier. Ryu stands tall, breathing hard, shinai hanging loose at his side. Kenji kneels on the floor, winded but unbroken, using his shinai to steady himself. Ryu extends a hand; Kenji takes it, and Ryu pulls him up. They bow deeply in mutual respect, then walk toward the exit side-by-side, shinais resting on their shoulders against the fading sun.

Case 2 - WINDLETTER

Shot 1:

On an afternoon hillside path, watercolor landscapes roll gently under warm earth tones. SORA (22) walks at an easy pace, her simple travel coat moving softly with the breeze. A paper WIND CHARM hangs from her backpack and flutters with lively, natural wind motion.

Shot 2:

AKI (27) stands beside a fork in the path, worn satchel at his side, quietly attentive. He points to the split trail without urgency. The wind charm twists, hesitates, then leans toward the left path as if deciding for them. Sora notices the charm's choice and lets curiosity guide her steps, calm and unhurried.

Shot 3:

They continue along the left trail together, footsteps soft on the ground. The charm keeps fluttering in small, consistent pulls, tugging in a steady direction. Aki watches it with a gentle expression, not questioning, simply following Sora's lead as the hillside air flows around them.

Shot 4:

The path leads into a bamboo grove where light filters through tall green stalks in quiet bands. Sora and Aki walk carefully, matching pace; Aki places each step with care to avoid snapping any bamboo. The wind charm spins more quickly for a moment, then settles, responding to a breeze that slips through like a whisper.

Shot 5:

Sora stops briefly and adjusts the charm's knot, fingers patient and practiced. The charm responds immediately, fluttering again with a pleased, buoyant rhythm. Aki stays close, giving her space, his calm presence reinforcing the grove's stillness and the sense that the charm understands something they don't need to name.

Shot 6:

In a small tea shed late in the afternoon, steam curls upward in a warm, quiet room with no signage and no text. Sora sets her backpack down; the wind charm sways near a window, moving gently with indoor drafts. Aki warms his hands over a kettle, then slides a cup of tea toward Sora without speaking, as if the gesture is already part of the journey.

Shot 7:

At sunset on a riverbank, the water turns gold under low light. A sudden breeze lifts the wind charm and angles it toward a narrow footbridge ahead. Sora and Aki cross slowly, their steps measured. Midway, Sora pauses, touches the charm lightly, and continues with a steadier confidence; Aki follows with a soft smile that feels like reassurance rather than celebration.

Shot 8:

At dusk on the far end of the old footbridge, they stop together as the air cools. The charm finally rests, becoming still after so much guiding motion. Sora gently unhooks the wind charm and ties it to the bridge railing, leaving it behind with care. The paper flutters once like a polite goodbye, and Sora and Aki walk onward side-by-side into the quiet evening.

Case 3 - BUTTONS OF BRAVERY

Shot 1:

Inside a cozy tailor shop in daylight, the space feels soft and welcoming—rounded shapes, warm light, and rich fabric textures everywhere. A brightly stitched LUCKY PATCH rests on a spool box like a tiny treasure waiting to be chosen.

Shot 2:

GRACE (25) lifts the patch with careful fingertips, treating it as if it might sneeze or wiggle away. OWEN (29) holds out his jacket with anxious hope, shoulders tense and eyes fixed on the patch. Grace places the patch against the jacket's chest; Owen freezes instantly, barely daring to breathe.

Shot 3:

On a bright festival street, colorful banners sway above a cheerful crowd that keeps flowing in every direction. Owen moves through the bustle like he's protecting a priceless artifact, keeping the jacket close and constantly checking the LUCKY PATCH. Grace stays right behind him, arms out in a gentle "buffer," guiding him through gaps before anyone can bump into him.

Shot 4:

A balloon cart suddenly swerves into their path. Owen dodges too hard and nearly stumbles, panic flashing across his face. Grace catches him by the elbow and steadies him before he can tumble. The patch flutters at the edge for a moment but stays on; Owen looks stunned, and Grace beams like this is exactly what she expected.

Shot 5:

At the fountain plaza, water sparkles and throws reflections across the stone while laughter and festival ambience fill the air without any spoken words. Owen notices a reflected glint and his expression tightens—he realizes the patch's edge is lifting just slightly, enough to trigger his full-body alarm.

Shot 6:

Grace calmly pulls a tiny needle kit from her pocket and signals Owen to stay perfectly still. Owen holds the jacket up as if it's sacred, eyes locked on her hands. Grace re-stitches the LUCKY PATCH with quick, confident motions, tightening the threads until it lies flat and secure again.

Shot 7:

They duck into a quiet arcade corner where shade softens the festival's intensity. Owen tries out a "confident walk," exaggerated and cartoonish, as if bravery is something you can practice step-by-step. Grace mirrors his performance for a beat, then taps his shoulder lightly and corrects his posture. Owen pats the patch to confirm it's holding; both share a satisfied grin.

Shot 8:

At sunset on the festival stage steps, Owen wears the jacket proudly, and the LUCKY PATCH catches the sun like a tiny medal. He climbs carefully but with steadier confidence than before. Grace watches from below, visibly proud. At the top, Owen turns and gives her a small, earnest bow; Grace returns a playful bow and slips the needle kit back into her pocket, content that the patch is safe.

Case 4 - CLAY HEARTS

Shot 1:

In a tiny clay kitchen in the morning, everything feels handcrafted—visible fingerprints on the counters, slightly uneven tiles, cozy warm light, and a gentle stop-motion jitter. A ceramic HEART TOKEN sits near a window and wobbles subtly on the uneven surface as wind lightly rattles the frame.

Shot 2:

MIRA (26), wearing a bright scarf and looking quietly determined, wraps the heart token in a soft cloth with careful, practiced hands. The bundle is small but treated like something precious; she tucks corners in, then pauses to check if anything is exposed.

Shot 3:

TOM (32), a gentle giant in a wool sweater, holds a small box labeled only by shape with no readable text. He keeps the box steady and close, offering it like a safe landing place. Both of them glance toward the window as the wind gives a sharper rattle, their faces showing the same silent concern.

Shot 4:

On the front porch, leaves skitter across the ground like paper cutouts, pushed by playful gusts. Mira clutches the cloth-wrapped bundle close to her chest while Tom steps in front of her, using his body as a windbreak, shoulders squared to block the strongest air.

Shot 5:

A sudden gust tugs at the cloth. The knot loosens and the fabric lifts at the edge, threatening to unwrap. Tom reacts with surprising delicacy, pinning the cloth with two fingers—gentle enough not to crush what's inside. Mira immediately tightens the knot again, her expression shifting into relieved gratitude.

Shot 6:

On a hill path, they climb together while the bundle bounces comically with each step, as if the wind and gravity are teaming up to test them. Mira slips on a pebble; Tom steadies her by the elbow before she can fall. The bundle opens just enough to reveal the ceramic heart token gleaming for a brief moment.

Shot 7:

Mira re-wraps the heart token with exaggerated stop-motion urgency—fast hands, big concerned eyes, careful folds—then presses the cloth tight to make sure it won't betray them again. Tom watches closely, then nods with quiet approval, staying near as if ready to shield it from the next gust.

Shot 8:

In the community hall in the afternoon, warm lamps glow over a small display stand. Mira sets the HEART TOKEN onto the stand with slow precision, making sure it sits centered. Tom adjusts the stand's legs until it stops wobbling. They lean back together, satisfied, then unconsciously mirror each other's proud posture, sharing a silent "we did it" moment.

BibTeX

@article{zhou2026videomemory,
        title={VideoMemory: Toward Consistent Video Generation via Memory Integration},
        author={Zhou, Jinsong and Du, Yihua and Xu, Xinli and Wang, Luozhou and Zhuang, Zijie and Zhang, Yehang and Li, Shuaibo and Hu, Xiaojun and Su, Bolan and Chen, Ying-cong},
        journal={arXiv preprint arXiv:2601.03655},
        year={2026}
      }