Chain-of-Thought Compression Should Not Be Blind:
V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring

1State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, XJTU, 2CASIA, 3Shanghai AI Lab, 4HITSZ, 5USTB
Visual Amnesia Concept

V-Skip solves Visual Amnesia. Standard text compression (middle) blindly prunes visually essential tokens (e.g., "red"), causing hallucinations. V-Skip (bottom) preserves these visual anchors via Dual-Path scoring.

Abstract

While Chain-of-Thought (CoT) reasoning significantly enhances the performance of Multimodal Large Language Models (MLLMs), its autoregressive nature incurs prohibitive latency constraints. Current efforts to mitigate this via token compression often fail by blindly applying text-centric metrics to multimodal contexts.

We identify a critical failure mode termed Visual Amnesia, where linguistically redundant tokens are erroneously pruned, severing the connection to the input image and leading to hallucinations. To address this, we introduce V-Skip, a novel framework that reformulates token pruning as a Visual-Anchored Information Bottleneck (VA-IB) optimization problem.

V-Skip employs a dual-path gating mechanism that weighs token importance through both linguistic surprisal and cross-modal attention flow. This allows the model to identify and rescue visually salient anchors that would otherwise be discarded. Extensive experiments on Qwen2-VL and Llama-3.2 families demonstrate that V-Skip achieves a 2.9× speedup with negligible accuracy loss, outperforming other baselines by over 30% on the DocVQA benchmark.

Methodology

V-Skip reformulates token pruning as a Visual-Anchored Information Bottleneck (VA-IB) problem. The pipeline consists of three stages:

  • Stage 1: Data Generation using a frozen Teacher MLLM.
  • Stage 2: Dual-Path Filtering & Pruning, utilizing both Linguistic Surprisal and Visual Attention Flow.
  • Stage 3: Efficient Fine-tuning via LoRA distillation.
V-Skip Architecture

As shown above (Figure 2), our method automates the construction of efficient multimodal reasoners. Unlike standard text compression which discards tokens based only on text probability, V-Skip rescues visually salient tokens (anchors) to prevent hallucination.

Qualitative Comparison

We compare V-Skip against standard text-centric pruning methods (e.g., LLMLingua-2). The example below (from DocVQA) demonstrates the Information Entropy Mismatch.

Qualitative comparison on DocVQA Qualitative comparison on DocVQA


In the invoice example, the key value "$45.20" has high linguistic perplexity (it looks like a random number to the LLM) but is visually grounded in the image. Standard methods (LLMLingua-2) prune it, leading to a wrong answer. V-Skip detects the high cross-modal attention and preserves the token, answering correctly.

BibTeX

@misc{zhangvskip,
      title={Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring}, 
      author={Dongxu Zhang and Yiding Sun and Cheng Tan and Wenbiao Yan and Ning Yang and Jihua Zhu and Haijun Zhang},
      year={2026},
      eprint={2601.13879},
      archivePrefix={arXiv},
      primaryClass={cs.MM},
      url={https://arxiv.org/abs/2601.13879}, 
}