The article discusses the role of large language models (LLMs) in chemistry, focusing on tasks such as drug discovery and material science. Despite the remarkable capabilities of LLMs in natural language processing tasks, their performance in chemistry tasks has been disappointingly low. However, the authors demonstrate that their developed LLMs can outperform even the most advanced GPT-4 models in all chemistry tasks. The key to this success is a large-scale, high-quality dataset for instruction tuning named SMolInstruct. The dataset contains 14 meticulously selected chemistry tasks and over three million high-quality samples. Based on SMolInstruct, they fine-tune a set of open-source LLMs, among which Mistral serves as the best base model for chemistry tasks.

 

Publication date: 15 Feb 2024
Project Page: https://osu-nlp-group.github.io/LLM4Chem/
Paper: https://arxiv.org/pdf/2402.09391