Efficient AI

CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

SSayan Deb SarkarRRémi PautratOOndrej MiksikMMarc PollefeysIIro ArmeniMMahdi RadMMihai Dusmanu
Published
February 13, 2026
Authors
7
Word Count
14,361

CoPE-VideoLM uses video codec primitives for efficient, fast video language model inference.

Abstract

Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to 86% and token usage by up to 93% compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on 14 diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.

Key Takeaways

  • 1

    Video language models waste computation by encoding every frame as full RGB images instead of using sparse temporal representations.

  • 2

    CoPE-VideoLM leverages video codec primitives like motion vectors and residuals to dramatically improve efficiency and reduce latency.

  • 3

    Codec-based representations enable real-time video understanding for robotics and applications requiring fast time-to-first-token responses.

Limitations

  • Current video models sample only 64 frames regardless of video length, missing temporal flow and motion dynamics.

  • Dense frame encoding creates bottlenecks in time-to-first-token latency, making real-time applications impractical.

More in Efficient AI

View all
CoPE-VideoLM: Codec Primitives For Efficient Video Language Models | Paperchime