4189058

Open-source large-scale foundation model for chemistry

Date
March 24, 2025

Large-scale pre-training methodologies for chemical language models have revolutionized the field of cheminformatics, offering significant advancements in tasks such as molecular property prediction and molecule generation. These models leverage the power of self-supervised learning to derive contextualized representations of input tokens by training on a large, unlabeled molecular datasets. Typically, the training process consists of two stages: pre-training on large, unannotated chemical corpora followed by fine-tuning on domain-specific tasks. This approach reduces the reliance on costly annotated datasets and enhances the model's capacity to generalize across a broader spectrum of chemical representations.

Here, we introduce a novel family of large-scale encoder-decoder chemical foundation models, pre-trained on a curated dataset of 91 million SMILES samples extracted and curated from PubChem. This dataset encompasses approximately 4 billion molecular tokens, allowing the model to capture an extensive range of chemical diversity. The pre-training strategy we employ focuses on maximizing the model's ability to encode structural and functional aspects of molecules, ensuring that it can generalize effectively to a wide range of downstream tasks. We present two main variants of the model: a base version with 289 million parameters and a Mixture-of-Experts version with 8×289M parameters, providing flexibility for different use cases. These models were evaluated across multiple benchmark datasets and demonstrated state-of-the-art performance in a range of tasks, including quantum properties and reaction yield predictions.

A key aspect of this work is the exploration of the model's latent embedding space. We present a preliminary assessment of its compositionality, a critical feature for reasoning-based tasks. The latent space demonstrates improved separability compared to state-of-the-art models, facilitating few-shot learning scenarios where minimal training data is available. This capability is especially valuable in chemical research, where adaptability and rapid learning from small datasets are essential. To support the wider research community, we are releasing the model weights on Hugging Face: https://huggingface.co/ibm/materials.smi-ted. Additionally, the codebase is available at the GitHub: https://github.com/IBM/material.

Related Products

Thumbnail for AI-assisted Workbench for Material Discovery
AI-assisted Workbench for Material Discovery
In our evolving society, many problems such as climate change, sustainable energy systems, pandemics, and others require faster advances…
Thumbnail for GMO-Mat: A foundation model-based generative multi-objective optimization framework for materials discovery
GMO-Mat: A foundation model-based generative multi-objective optimization framework for materials discovery
Chemical Foundation Models (FM) have successfully supported the creation of state-of-the-art property predictors, which evidence the quality of the corresponding latent space representations created by these models…
Thumbnail for Expanded dataset for improved prediction of chemical biodegradability
Expanded dataset for improved prediction of chemical biodegradability
Biodegradability is a crucial factor in assessing the long-term impact of chemicals on the environment. However, experimental testing to determine biodegradability is time-consuming and laborious…
Thumbnail for Modeling electron density grids using a 3D VQ-GAN approach
Modeling electron density grids using a 3D VQ-GAN approach
To convert 3D electron density grids into meaningful latent representations, vector quantized autoencoders have proven effective, particularly in addressing the blurriness typical of traditional variational autoencoders…