Building an R&D Gen AI Pipeline for The Bends
By: Khayyam Khan and Carin Mazaira
As generative AI tools continue to evolve, filmmakers are experimenting with how these technologies can fit into real production environments. In the short film The Bends, the team explored one possible approach to building a Gen AI–assisted workflow that prioritized creative intent, visual consistency, and artist control, even as the underlying technology remained in flux. We interviewed Khayyam Khan, Head of AI on The Bends, to discover his AI workflow for the short film. Keep in mind: With the ever-changing AI world we live in, many processes have changed since we worked on the film.
From LookDev to Video: The Core Loop
The workflow began with Look Development (LookDev) to establish the visual language of the film. Using Google Imagen, the team generated still imagery to explore environments, lighting, mood, and early character designs. Working in stills allowed for rapid iteration without committing to motion too early.
This phase also revealed limitations. Imagen struggled to consistently generate the film’s main character, a rare fish known as a Blob Sculpin, or blob fish. To address this, the team used the licensed images of the fish passed it through Meshy (image-to-3d service) to create a 3D mesh of the character that an artist then sculpted to the director’s liking. This 3D model became the primary source for all character reference images generated in Imagen and be utilized for creating datasets and training LORA’s.
Once characters and environments were defined, they were combined to create key stills for each shot. These images served as visual anchors for the next stage of the pipeline. Each key still was analyzed using a vision-enabled language model, which had the knowledge of the screenplay and the shotlist.The LLM generated a detailed natural-language description of the image using the context of the story to generate prompts. This captioning step made it possible to refine prompts based on what was actually present in the frame. The prompts and images were then passed into Veo for image-to-video generation.
“Using key stills as constraints dramatically stabilized video generation and helped move away from the slot-machine nature of the models.” - Khayyam Khan
Orchestrating AI with Custom Tools
Rather than using a standard prompt interface, ComfyUI was used as an orchestration layer. The team built custom nodes that leveraged Imagen’s subject customization capabilities, allowing multiple reference images of the lead character to be used for consistent generations. Prior to this feature, reference images were typically limited to open-source models, which often restricted resolution. Imagen’s ability to generate outputs at 2K and above made it particularly useful for production-quality visuals.
Bridging AI Outputs and Traditional Pipeline
The team designed a Chrome extension to streamline the handoff from Veo to traditional production tracking tools like ShotGrid. The goal was to allow artists to select scenes and shots directly within the generation UI and automatically submit outputs to the appropriate ShotGrid entries.
While this approach worked in theory and promised to reduce friction, it was not sufficiently battle-tested for production. As a result, artists manually uploaded their selected outputs, supported by a dedicated ShotGrid coordinator who ensured assets remained organized and moving smoothly through the pipeline.
Workflow Priorities
Across the project the team focused on:
Character and scene consistency across shots
Natural language workflows that matched how artists think and speak
Fast iteration without sacrificing control
Enabling artist intent as the driver of the visual outcome
These criteria shaped tool choices, automation efforts, and the balance between AI and traditional 3D methods.
What Worked and What Didn’t
Several aspects exceeded expectations:
A RAG-based agent aware of the screenplay and shot list reduced prompt overhead
Using key stills as constraints stabilized video generation
Blender provided permanence and camera freedom across environments
Challenges included:
Lack of image-to-video availability in APIs, limiting automation
Reliance on frontend tools for generation reduced scriptability
Multi-stage editing degraded quality over repeated passes (pre-NanoBanana)
Learning and Looking Ahead
This workflow wasn’t easy to learn at first, largely because the tools were early in their evolution. Introducing natural language agents into the workflow made it much easier for artists to generate images using text prompts rather than relying on technical prompting, but outputs were not perfect, and models still needed careful shepherding. Since then, newer tools have begun collapsing steps and improving stability.
“Consistency is fragile without structural guardrails, which is why grounded control through 3D environments and reference imagery became so important.” - Khayyam Khan
Khan said if he were to rebuild today, his team would lean into agentic automations, invest more in 3D grounding, and explore video-to-video tools to directly guide shot motion. Across all of this, the central lesson remains: structure, control, and consistency are what make generative AI viable for production-grade work.