The article discusses an emerging area called Representation Engineering (RepE). This approach aims to increase the transparency of AI systems using insights from cognitive neuroscience. RepE focuses on population-level representations instead of neurons or circuits, providing new ways to monitor and manipulate high-level cognitive phenomena in deep neural networks. The authors demonstrate how RepE can address various safety-related problems in large language models, such as honesty, harmlessness, and power-seeking. The research hopes to encourage further exploration of RepE and advancements in AI transparency and safety.

 

Publication date: 2 Oct 2023
Project Page: https://github.com/andyzoujm/representation-engineering
Paper: https://arxiv.org/pdf/2310.01405