Call for Papers
The papers submitted are non-archival, they will not be published in the CVPR Workshop Proceedings.
Important Dates
Submission Deadline: April 20, 2026
Review Deadline: May 1, 2026
Notification to authors: May 10, 2026
Final camera-ready deadline: May 15, 2026
Topics of Interest
We invite submissions addressing various aspects of Video Large Language Models, including but not limited to the following areas:
Methods/Algorithms
Training MLLMs on video with objective/reward design and efficient learning, extended to predictive world-model objectives, VLA perception–language–action loops, and unified backbones that share representations across image–video–audio–text.
Data Creation
Innovative strategies for leveraging web video data, advanced filtering techniques, synthetic data generation, long video understanding datasets and enriching datasets for video instruction tuning.
Evaluation and Analysis
Robust evaluation frameworks for existing models, focusing on improving interpretability, deriving novel insights, and introducing new metrics and benchmarks for VidLLMs.
Best Practices
Reproducible training/eval protocols, including closed-loop rollout training, horizon-aware curricula, and policy/plan logging for unified/VLA systems.
Applications
From classic CV tasks to simulation and robotics, planning/forecasting with world models, and action-grounded assistants powered by unified VidLLM stacks.
Comparison and Benchmarking
Systematic studies versus expert CV pipelines and among world-model, VLA, and unified approaches, measuring generalization, data/compute efficiency, and cost.
Limitations, Risks and Safety
Addressing bias, fairness, and ethical challenges, including factuality, hallucination, and safety concerns in VidLLMs, and action safety and intervention, error compounding in long-horizon rollouts, and modality leakage in unified models.
Emerging Research Areas
Long-form video understanding and efficient high-resolution video tasks, video grounding LLMs, and fully unified multimodal architectures.
Vision-Language-Action (VLA)
Integrating perception-language-action loops for robotics, simulation, and embodied video reasoning.
World Models
Learning environment dynamics, future state prediction, and generative modeling of physically consistent reality.
Submission Guidelines
We accept both short (4 pages) and long (8 pages) papers.
Apart from page count, submissions should follow the CVPR format. Please see the complete guidelines at: CVPR 2026 Author Instructions (available closer to the event)
Submission portal: OpenReview
Program Committee
We have a diverse PC drawn from academia and industry. Each submission will receive at least two blind reviews.