Vision Foundation Models (VFMs) have demonstrated outstanding performance on numerous downstream tasks. However, due to their inherent representation biases originating from different training paradigms, VFMs exhibit advantages and disadvantages across distinct vision tasks. Although amalgamating the strengths of multiple VFMs for downstream tasks is an intuitive strategy, effectively exploiting these biases remains a significant challenge. In this paper, we propose a novel and versatile "Swiss Army Knife" (SAK) solution, which adaptively distills knowledge from a committee of VFMs to enhance multi-task learning. Unlike existing methods that use a single backbone for knowledge transfer, our approach preserves the unique representation bias of each teacher by collaborating the lightweight Teacher-Specific Adapter Path modules with the Teacher-Agnostic Stem. Through dynamic selection and combination of representations with Mixture-of-Representations Routers, our SAK is capable of synergizing the complementary strengths of multiple VFMs. Extensive experiments show that our SAK remarkably outperforms prior state of the arts in multi-task learning by 10% on the NYUD-v2 benchmark, while also providing a flexible and robust framework that can readily accommodate more advanced model designs.
VFMs are pretrained on diverse datasets, image resolutions, and objectives, introducing representation biases when applied as feature extractors for downstream tasks. Our quantitative and qualitative results reveal that these inherent biases yield both advantages and disadvantages across different tasks, with no single model achieving consistently superior performance across all domains. These findings highlight the challenge of accomplishing comprehensive improvements in multi-task learning using VFMs, pointing to the demand for collaborative utilization of multiple VFMs to exploit their complementary strengths.
Recent works distill multiple VFM teachers into a single student model, this many-to-one distillation risks eliminating the representation biases of the VFM teachers, potentially limiting the model's ability to capitalize on their individual strengths for specific tasks. Our pilot study shows that the student trained by many-to-one distillation does not consistently surpass the teachers in their respective proficient tasks.
SAK's framework incorporates a shared Teacher-Agnostic Stem (TAS) alongside multiple Teacher-Specific Adapter Path (TSAP) modules, which produce specialized representations aligned with each corresponding VFM teacher. The Teacher-Agnostic Stem captures universal knowledge, while the Teacher-Specific Adapter Paths accommodate the heterogeneous representation biases of each teacher, explicitly learning their diverse model characteristics. To amalgamate the committee's expertise, we treat each group of representations as a knowledgeable expert and design a Mixture-of-Representations (MoR) Router. This router dynamically weighs and combines the most relevant representations, bridging the gap between general-purpose knowledge and task-specific characteristics.
We evaluate SAK on two widely-used multi-task benchmarks, PASCAL-Context and NYUD-v2, showing it remarkably outperforms previous multi-teacher VFM distillation methods and state-of-the-art multi-task models in both performance and robustness.
@inproceedings{lu2025swiss,
title={Swiss Army Knife: Synergizing Biases in Knowledge from Vision Foundation Models for Multi-Task Learning},
author={Yuxiang Lu and Shengcao Cao and Yu-Xiong Wang},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025}
}