This research paper reveals that Large Language Models (LLMs), despite being trained solely on textual data, are surprisingly effective encoders for purely visual tasks. This is achieved by using a frozen transformer block from pre-trained LLMs to directly process visual tokens. The approach enhances performance across a diverse range of tasks, including pure 2D and 3D visual recognition tasks, temporal modeling tasks, non-semantic tasks, and multi-modal tasks. The paper also proposes the information filtering hypothesis to explain the effectiveness of pre-trained LLMs in visual encoding.
Publication date: 19 Oct 2023
Project Page: https://github.com/ziqipang/LM4VisualEncoding
Paper: https://arxiv.org/pdf/2310.12973