While Chain-of-Thought (CoT) reasoning significantly enhances the performance of Multimodal Large Language Models (MLLMs), its autoregressive nature incurs prohibitive latency constraints. Current efforts to mitigate this via token compression often fail by blindly applying text-centric metrics to multimodal contexts.
We identify a critical failure mode termed Visual Amnesia, where linguistically redundant tokens are erroneously pruned, severing the connection to the input image and leading to hallucinations. To address this, we introduce V-Skip, a novel framework that reformulates token pruning as a Visual-Anchored Information Bottleneck (VA-IB) optimization problem.
V-Skip employs a dual-path gating mechanism that weighs token importance through both linguistic surprisal and cross-modal attention flow. This allows the model to identify and rescue visually salient anchors that would otherwise be discarded. Extensive experiments on Qwen2-VL and Llama-3.2 families demonstrate that V-Skip achieves a 2.9× speedup with negligible accuracy loss, outperforming other baselines by over 30% on the DocVQA benchmark.
V-Skip reformulates token pruning as a Visual-Anchored Information Bottleneck (VA-IB) problem. The pipeline consists of three stages:
As shown above (Figure 2), our method automates the construction of efficient multimodal reasoners. Unlike standard text compression which discards tokens based only on text probability, V-Skip rescues visually salient tokens (anchors) to prevent hallucination.
We compare V-Skip against standard text-centric pruning methods (e.g., LLMLingua-2). The example below (from DocVQA) demonstrates the Information Entropy Mismatch.
In the invoice example, the key value "$45.20" has high linguistic perplexity (it looks like a random number to the LLM)
but is visually grounded in the image.
Standard methods (LLMLingua-2) prune it, leading to a wrong answer.
V-Skip detects the high cross-modal attention and preserves the token, answering correctly.
@misc{zhangvskip,
title={Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring},
author={Dongxu Zhang and Yiding Sun and Cheng Tan and Wenbiao Yan and Ning Yang and Jihua Zhu and Haijun Zhang},
year={2026},
eprint={2601.13879},
archivePrefix={arXiv},
primaryClass={cs.MM},
url={https://arxiv.org/abs/2601.13879},
}