Frozen Transformers in Language Models Are Effective Visual Encoder Layers

This research paper reveals that Large Language Models (LLMs), despite being trained solely on textual data, are surprisingly effective encoders for purely visual tasks. This is achieved by using a frozen transformer block from pre-trained LLMs to directly process visual tokens. The approach enhances performance across a diverse range of tasks, including pure 2D and 3D visual recognition tasks, temporal modeling tasks, non-semantic tasks, and multi-modal tasks. The paper also proposes the information filtering hypothesis to explain the effectiveness of pre-trained LLMs in visual encoding.

Publication date: 19 Oct 2023
Project Page: https://github.com/ziqipang/LM4VisualEncoding
Paper: https://arxiv.org/pdf/2310.12973

Post Views: 341

Frozen Transformers in Language Models Are Effective Visual Encoder Layers

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

FSD: Fast Self-Supervised Single RGB-D to Categorical 3D Objects

CLAIR: Evaluating Image Captions with Large Language Models

Leave a Reply Cancel reply

Please allow ads on our site