Natural language processing and deep learning to be applied in chemical space

The EPSRC has awarded an £89.5K ‘discipline hopping’ grant for new research into navigating chemical space with natural language processing (NLP) and deep learning.

AdobeStock

Chemist Dr Jiayun Pang, from Greenwich University, will work with Dr Ivan Vulić, an NLP and machine learning expert from Cambridge University, to study the latest developments in the field of NLP and to examine their further applicability in the chemical field.

NLP lies at the intersection between linguistics and computer science which aims to process and analyse human language, typically provided as written text.

NLP is now strongly focused on the use of machine learning to challenge tasks with revolutionary algorithms. They now underpins a wide range of real-life applications, such as ChatGPT, virtual assistants and automatic text completion.

This research will specifically explore how Transformer models, a deep learning algorithm developed by Google in 2017, can be adapted to solve research challenges in chemistry.

The researchers said that, whilst chemical structures are usually three dimensional, they are also often converted into Simplified Molecular Input Line Entry Systems (SMILES), a simple vocabulary of chemical elements and bond symbols with grammatical rules of how the chemical elements are positioned.

The researchers said that, through SMILES, it is possible to use NLP algorithms to analyse chemical structures in a similar fashion as they are used to analyse text.

"We are working to harness the power of state-of-the-art NLP algorithms for an extensive range of tasks, such as molecular similarity search, chemical reaction prediction and chemical space exploration," said Dr Pang.

"We believe in the power of cross disciplinary research to find collaborative AI solutions for science and engineering challenges."

The research will explore termed transfer learning, a concept now pervasive in machine learning and NLP, which reuses previously developed machine learning models in other tasks.

With this approach, the researchers can repurpose large general-purpose models, specialise them for specific applications, and lessen the data annotation needed to develop a model from scratch, which would require further expense and expert knowledge.

The Transformer models will be trained to learn a latent representation of the chemical space defined by millions of SMILES. This learned latent representation will then be used to predict molecular properties for a given chemical structure during fine-tuning.

The researchers said that the advantage to this approach is that the resulting machine learning models will rely less on the labelled data (molecules with experimentally determined properties), which are time-consuming or even impossible to generate in chemistry considering the associated cost and experimental challenges.

The study will aim to make the Transformer models more computationally efficient and accurate using two latest machine learning techniques, termed sentence encoding and contrastive learning.

Beginning in February 2024, the research aims to provide an alternative approach to evaluate molecular structures against their properties, which underpins many research and development tasks in the chemical and pharmaceutical industries.