CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

SSayan Deb SarkarRRémi PautratOOndrej MiksikMMarc PollefeysIIro ArmeniMMahdi RadMMihai Dusmanu

Published: February 13, 2026
Authors: 7
Word Count: 14,361

CoPE-VideoLM uses video codec primitives for efficient, fast video language model inference.

Abstract

Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to 86% and token usage by up to 93% compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on 14 diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.

Key Takeaways

1
Video language models waste computation by encoding every frame as full RGB images instead of using sparse temporal representations.
2
CoPE-VideoLM leverages video codec primitives like motion vectors and residuals to dramatically improve efficiency and reduce latency.
3
Codec-based representations enable real-time video understanding for robotics and applications requiring fast time-to-first-token responses.

Limitations

Current video models sample only 64 frames regardless of video length, missing temporal flow and motion dynamics.
Dense frame encoding creates bottlenecks in time-to-first-token latency, making real-time applications impractical.

More in Efficient AI

View all

SLA2: Sparse-Linear Attention with Learnable Routing and QAT

Jintao Zhang, Haoxu Wang +7

Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split tha...

Feb 1349

SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning

Jintao Zhang, Kai Jiang +6

Many training-free sparse attention methods are effective for accelerating diffusion models. Recently, several works suggest that making sparse attention trainable can further increase sparsity while ...

Feb 1342

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Yue Ding, Yiyan Ji +13

Omni-modal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial c...

Feb 438

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Yongtong Wu, Shaoyuan Chen +11

The performance of multi-turn, agentic LLM inference is increasingly dominated by KV-Cache storage I/O rather than computation. In prevalent disaggregated architectures, loading the massive KV-Cache f...

Feb 2530

Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

Haocheng Xi, Shuo Yang +14

Despite rapid progress in autoregressive video diffusion, an emerging system algorithm bottleneck limits both deployability and generation capability: KV cache memory. In autoregressive video generati...

Feb 329

More Efficient AI papers