Token Fusion: Bridging Token Pruning and Merging for Efficient Vision Transformers

1. Introduction & Overview

Vision Transformers (ViTs) have revolutionized computer vision but suffer from high computational cost due to the quadratic complexity of self-attention with respect to the number of input tokens. This paper, Token Fusion: Bridging the Gap between Token Pruning and Token Merging, introduces Token Fusion (ToFu), a hybrid method that dynamically chooses between pruning and merging tokens based on model behavior to optimize efficiency-accuracy trade-offs.

The core insight is that neither pruning (discarding tokens) nor merging (averaging tokens) is universally optimal. The paper proposes a principled way to select the appropriate operation per layer, coupled with a novel merging technique called MLERP (Multi-token Linear intERPolation) to address distribution shift issues in standard average merging.

2. Core Methodology: Token Fusion (ToFu)

ToFu is built on the analysis of a model's response to interpolated inputs, determining its suitability for merging or pruning.

2.1. The Pruning vs. Merging Dilemma

The authors identify a key criterion: model linearity. If a model layer responds nearly linearly to interpolated inputs (e.g., $f(\alpha x_1 + (1-\alpha)x_2) \approx \alpha f(x_1) + (1-\alpha)f(x_2)$), merging similar tokens via averaging is effective and preserves information. However, in early/deeper non-linear layers (as visualized in their Figure 1), linear interpolation in input space leads to highly non-linear outputs, making average merging problematic and potentially causing distribution shifts. In such cases, pruning less important tokens is a safer, albeit lossier, alternative.

2.2. The ToFu Framework

ToFu operates per transformer block:

Token Importance Scoring: Assigns an importance score to each token (e.g., based on attention norm or gradient).
Linearity Assessment: Evaluates the layer's approximate linearity, often derived empirically or via a lightweight probe.
Adaptive Operation: For a target token reduction ratio:
- In high-linearity regions: Merge the least important tokens with their most similar, important neighbors.
- In low-linearity regions: Prune the least important tokens outright.

This creates a dynamic, context-aware compression pipeline.

2.3. MLERP: Norm-Preserving Merging

To improve upon simple averaging, the authors propose MLERP, an adaptation of Spherical Linear Interpolation (SLERP) for merging $K$ tokens. For tokens $t_1, t_2, ..., t_K$ with norms $n_i = ||t_i||$, MLERP first interpolates directions on the unit sphere and then scales by a weighted average of the original norms:

$t_{\text{merged}} = \left( \frac{\sum_{i=1}^K w_i n_i}{\| \sum_{i=1}^K w_i \frac{t_i}{n_i} \|} \right) \left( \sum_{i=1}^K w_i \frac{t_i}{n_i} \right)$

where $w_i$ are importance-based weights. This preserves the statistical norm distribution of features, mitigating the distribution shift caused by naive averaging and leading to more stable performance, especially in non-linear regimes.

3. Technical Details & Mathematical Formulation

The paper formalizes the token reduction problem. Let a layer have $N$ input tokens $T = \{t_1, ..., t_N\}$. The goal is to produce a reduced set $T'$ with $M < N$ tokens.

Key Equations:

Importance Score: $I(t_i) = ||\text{Attn}(t_i)||_1$ or a gradient-based measure.
Similarity Metric: Typically cosine similarity $S(t_i, t_j) = \frac{t_i \cdot t_j}{||t_i|| \, ||t_j||}$.
Linearity Metric ($\mathcal{L}$): Measured by the deviation of layer outputs from linear interpolation of inputs. A low $\mathcal{L}$ favors merging; a high $\mathcal{L}$ favors pruning.

The ToFu algorithm can be applied to pre-trained models without fine-tuning (zero-shot) or enhanced with light training.

4. Experimental Results & Performance

The authors evaluate ToFu on image classification (ImageNet with ViT-B/16, DeiT) and image generation (latent diffusion models) tasks.

Key Performance Highlights

Classification: ToFu achieves a better accuracy vs. FLOPs trade-off than standalone pruning (e.g., DynamicViT) or merging (ToMe) methods. For example, at 40% FLOPs reduction, ToFu loses <0.5% top-1 accuracy on ImageNet, outperforming ToMe by ~0.3%.
Image Generation: In Stable Diffusion, ToFu maintains higher visual fidelity (measured by FID) at reduced computational cost compared to ToMe, especially when reducing a large number of tokens. MLERP merging shows clearer advantage in generation tasks where output distribution is critical.
Ablation: The adaptive strategy (choosing merge/prune) is shown to be superior to using either operation exclusively across all layers. MLERP consistently outperforms average merging.

Chart Description (Based on Paper's Figure 1): The figure illustrates the non-linearity of ViT layers. Two input feature points (x1, x2) are linearly interpolated (colored line). The outputs (f1-f4) from four different MLP layers inside the ViT are plotted. The early and late MLP outputs (f1, f4) show significant deviation from a straight line, indicating strong non-linearity. The average of the two inputs (purple star) maps to an output point far from the average of the outputs, visually demonstrating why average merging can fail in non-linear layers.

5. Analysis Framework & Case Example

Case: Applying ToFu to a Pre-trained ViT for Edge Deployment

Scenario: A developer needs to run a ViT-B model on a mobile device for real-time image classification. The full model is too slow.

Framework Application:

Profiling: Run a small calibration dataset through the model. For each transformer block, compute the linearity metric $\mathcal{L}$ by sampling token pairs and checking output interpolation error.
Strategy Map: Create a profile: Blocks 1-3 (low linearity) → prefer pruning. Blocks 4-8 (high linearity) → prefer MLERP merging. Final blocks (low linearity) → prefer pruning.
Configuration: Set a global token reduction budget (e.g., 35%). Apply pruning in low-linearity blocks and MLERP merging in high-linearity blocks, respecting per-block budgets derived from the importance scores.
Evaluation: Deploy the compressed model. The adaptive approach ensures minimal accuracy drop compared to a one-size-fits-all method, as it avoids aggressive merging in sensitive non-linear layers.

This example demonstrates ToFu's practical utility as a structured compression framework, not just a monolithic algorithm.

6. Future Applications & Research Directions

Multimodal Transformers: Extending ToFu to video, audio, or multimodal (e.g., CLIP, Flamingo) transformers where token dynamics are more complex.
Hardware-Aware Co-design: Optimizing the ToFu decision algorithm (prune/merge) and MLERP implementation for specific AI accelerators (NPUs, GPUs) to maximize real speedup.
Integration with Other Techniques: Combining ToFu with quantization, knowledge distillation, or efficient attention mechanisms (like Linformer) for compounded efficiency gains.
Automated Hyperparameter Search: Using neural architecture search (NAS) or reinforcement learning to automatically determine the optimal per-layer pruning/merging ratio and linearity threshold.
Beyond Vision: Exploring its efficacy in Large Language Models (LLMs) for sequence compression, though the token semantics differ significantly.

7. References

Dosovitskiy, A., et al. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." ICLR 2021.
Bolya, D., et al. "Token Merging: Your ViT But Faster." ICLR 2023 (ToMe).
Wang, Y., et al. "DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification." NeurIPS 2021.
Rombach, R., et al. "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR 2022.
Zhu, J.Y., et al. "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks." ICCV 2017 (CycleGAN).
Vaswani, A., et al. "Attention Is All You Need." NeurIPS 2017.

8. Expert Analysis & Critical Insights

Core Insight: ToFu isn't just another compression tool; it's a formal recognition that transformer layers are heterogenous. Treating all layers with the same compression primitive is naive. The paper's brilliance lies in its diagnostic approach—measuring layer linearity to prescribe the right "surgery" (prune or merge). This is reminiscent of how modern compilers profile code to apply optimizations, a level of sophistication often missing in ML efficiency research.

Logical Flow: The argument is compelling: 1) Show average merging fails in non-linear layers (Fig. 1). 2) Propose a metric to detect this failure mode (linearity). 3) Use the metric to route tokens. 4) Fix the failing operation (average merge) with MLERP. The flow from problem identification to a multi-component solution is clean and logical.

Strengths & Flaws:
Strengths: The hybrid approach is theoretically sound and empirically validated across tasks. MLERP is a simple yet clever fix to a real problem (norm collapse). The zero-shot applicability is a major practical advantage for deploying existing models.
Flaws: The paper slightly undersells the overhead of the "linearity assessment." Is it a pre-computed profile (static) or computed on-the-fly (dynamic overhead)? The benefits of MLERP, while clear, appear modest in classification; its true value seems more pronounced in generative tasks, aligning with findings from diffusion model literature where output distribution is paramount. The comparison, while fair, could be more aggressive against state-of-the-art post-training quantization methods which offer orthogonal benefits.

Actionable Insights: For practitioners: Immediately adopt ToFu/MLERP as your first-line token reduction method for ViTs, especially for generative tasks. It supersedes ToMe as the default merging strategy. For researchers: The "layer-aware compression" paradigm is the key takeaway. Future work should focus on automating the detection of compression-friendly vs. compression-sensitive model regions, perhaps drawing inspiration from work on network pruning in CNNs or the analysis of mode collapse in GANs like CycleGAN. The next frontier is building models that are inherently efficient by design, using insights from such diagnostic studies to inform architecture search, moving beyond mere post-hoc compression.