4109780

MoLMamba: A large state-space-based foundation model for Chemistry

Date
August 20, 2024

Chemical foundation models (FMs) have emerged as potent catalysts for scientific discovery, leveraging extensive pretraining on vast unlabeled datasets. Typically built upon sequence architectures like Transformers, these models excel in processing a wide array of inputs, ranging from SMILES notations to 3D images.

In this research, we present MoLMamba, a novel foundation model pretrained on a dataset encompassing 91 million SMILES (equivalent to 4.3 billion tokens) extracted from PubChem and carefully curated. Diverging from conventional Transformer architectures, MoLMamba adopts a state-space approach, offering advantages such as accelerated inference speeds and linear scalability with variable sequence length. Even when confronted with sequences spanning billions of tokens, MoLMamba maintains robust performance.

Evaluation on the MoleculeNet benchmark dataset underscores MoLMamba's capabilities across diverse tasks and domains. Its performance is compared with the current state-of-the-art methodologies, affirming its efficacy in chemical machine learning applications. MoLMamba's introduction marks a significant advancement in the field, offering promising avenues for further exploration and application in real-world chemical contexts.

Related Products

Thumbnail for Open-source large-scale foundation model for chemistry
Open-source large-scale foundation model for chemistry
Large-scale pre-training methodologies for chemical language models have revolutionized the field of cheminformatics, offering significant advancements in tasks such as molecular property prediction and molecule generation…
Thumbnail for Expanded dataset for improved prediction of chemical biodegradability
Expanded dataset for improved prediction of chemical biodegradability
Biodegradability is a crucial factor in assessing the long-term impact of chemicals on the environment. However, experimental testing to determine biodegradability is time-consuming and laborious…
Thumbnail for GMO-Mat: A foundation model-based generative multi-objective optimization framework for materials discovery
GMO-Mat: A foundation model-based generative multi-objective optimization framework for materials discovery
Chemical Foundation Models (FM) have successfully supported the creation of state-of-the-art property predictors, which evidence the quality of the corresponding latent space representations created by these models…
Thumbnail for Modeling electron density grids using a 3D VQ-GAN approach
Modeling electron density grids using a 3D VQ-GAN approach
To convert 3D electron density grids into meaningful latent representations, vector quantized autoencoders have proven effective, particularly in addressing the blurriness typical of traditional variational autoencoders…