Second Workshop on Video Large Language Models

Call for Papers

The papers submitted are non-archival, they will not be published in the CVPR Workshop Proceedings.

Important Dates

Submission Deadline: April 20, 2026

Review Deadline: May 1, 2026

Notification to authors: May 10, 2026

Final camera-ready deadline: May 15, 2026

Topics of Interest

We invite submissions addressing various aspects of Video Large Language Models, including but not limited to the following areas:

Methods/Algorithms

Training MLLMs on video with objective/reward design and efficient learning, extended to predictive world-model objectives, VLA perception–language–action loops, and unified backbones that share representations across image–video–audio–text.

Data Creation

Innovative strategies for leveraging web video data, advanced filtering techniques, synthetic data generation, long video understanding datasets and enriching datasets for video instruction tuning.

Evaluation and Analysis

Robust evaluation frameworks for existing models, focusing on improving interpretability, deriving novel insights, and introducing new metrics and benchmarks for VidLLMs.

Best Practices

Reproducible training/eval protocols, including closed-loop rollout training, horizon-aware curricula, and policy/plan logging for unified/VLA systems.

Applications

From classic CV tasks to simulation and robotics, planning/forecasting with world models, and action-grounded assistants powered by unified VidLLM stacks.

Comparison and Benchmarking

Systematic studies versus expert CV pipelines and among world-model, VLA, and unified approaches, measuring generalization, data/compute efficiency, and cost.

Limitations, Risks and Safety

Addressing bias, fairness, and ethical challenges, including factuality, hallucination, and safety concerns in VidLLMs, and action safety and intervention, error compounding in long-horizon rollouts, and modality leakage in unified models.

Emerging Research Areas

Long-form video understanding and efficient high-resolution video tasks, video grounding LLMs, and fully unified multimodal architectures.

Vision-Language-Action (VLA)

Integrating perception-language-action loops for robotics, simulation, and embodied video reasoning.

World Models

Learning environment dynamics, future state prediction, and generative modeling of physically consistent reality.

Submission Guidelines

We accept both short (4 pages) and long (8 pages) papers.

Apart from page count, submissions should follow the CVPR format. Please see the complete guidelines at: CVPR 2026 Author Instructions (available closer to the event)

Submission portal: OpenReview

Program Committee

We have a diverse PC drawn from academia and industry. Each submission will receive at least two blind reviews.