Multimodal AI

Future Optical Flow Prediction Improves Robot Control & Video Generation

KKanchana RanasingheHHonglu ZhouYYu FangLLuyu YangLLe XueRRan XuCCaiming XiongSSilvio SavareseMMichael S RyooJJuan Carlos Niebles
Published
January 15, 2026
Authors
10
Word Count
13,362
Code
Includes code

Predicting future motion improves robotics and video creation.

Abstract

Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored. We introduce FOFPred, a novel language-conditioned optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture. This unique combination enables strong multimodal reasoning with pixel-level generative fidelity for future motion prediction. Our model is trained on web-scale human activity data-a highly scalable but unstructured source. To extract meaningful signals from this noisy video-caption data, we employ crucial data preprocessing techniques and our unified architecture with strong image pretraining. The resulting trained model is then extended to tackle two distinct downstream tasks in control and generation. Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred, confirming the value of a unified VLM-Diffusion architecture and scalable learning from diverse web data for future optical flow prediction.

Key Takeaways

  • 1

    Predicts future optical flow from single frame and text.

  • 2

    Enhances robot control and video generation realism.

  • 3

    Utilizes a unified Vision-Language Model and Diffusion architecture.

Limitations

  • Requires large-scale web videos for training.

  • Complexity in handling noisy and dynamic data.

Keywords

Vision-Language ModelDiffusion architectureoptical flow forecastingmultimodal reasoningpixel-level generative fidelityweb-scale human activity datadata preprocessingimage pretrainingrobotic manipulationvideo generation

More in Multimodal AI

View all
Future Optical Flow Prediction Improves Robot Control & Video Generation | Paperchime