Large Language Chemistry Dataset and Model
This project aims to create an extensive chemistry dataset and use it to train large language models (LLMs) that can leverage the data for a wide range of chemistry applications.
The dataset itself will be very valuable for downstream NLP research focused on chemistry. The use-cases for the model include generative applications from text to chemical data (e.g., molecular structure, properties, reactions, etc.) and the other way around, generation of code for computational chemistry applications, prediction of properties of materials and molecules (i.e., classification, and regression tasks) as well as chemical reactions. However, the applications should not be limited to those, as we want to specifically look for (new) emergent behaviors not present in smaller LMs or LMs trained on less chemistry data.
Furthermore, the resulting models should be investigated with different prompting and sampling strategies to examine their capabilities. Such a capable chemistry LLM has a range of potential use cases ranging from a tool for experts that want to create new ideas to non-experts that could use the tool to explore new chemistry topics. Furthermore, such a tool can be helpful in education as an intelligent learning assistant.
Join the discussion on this project in our #chemnlp channel on our Discord server.