Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

The DPhil in Computational Discovery is a multidisciplinary programme spanning projects in Advanced Molecular Simulations, Machine Learning and Quantum Computing to develop new tools and methodologies for life sciences discovery.

© John Cairns / Oxford University Images

This innovative course has been developed in close partnership between Oxford University and IBM Research. Each research project has been co-developed by Oxford academics working with IBM scientists. Students will have a named IBM supervisor/s and many opportunities for collaboration with IBM throughout the studentship.

The scientific focus of the programme is at the interface between Physical and Life Sciences. By bringing together advances in data and computing science with large complex sets of experimental data more realistic and predictive computational models can be developed. These new tools and methodologies for computational discovery can drive advances in our understanding of fundamental cellular biology and drug discovery. Projects will span the emerging fields of Advanced Molecular Simulations, Machine Learning and Quantum Computing addressing both fundamental questions in each of these fields as well as at their interfaces.

Students will benefit from the interdisciplinary nature of the course cohort as well as the close interactions with IBM Scientists.

Applicants who are offered places will be provided with a funding package that will include fees at the Home rate, a stipend at the standard Research Council rate (currently £17,668 pa) + £2,400 for four years. 

Project 1

Title: Robustness and generalisation of graph machine learning models 
PI: Xiaowen Dong and Mihai Cucuringu  

Summary: Can we predict whether a molecule can be a potent drug against certain bacteria from its chemical structure? Can we detect whether a piece of news on a social media site corresponds to misinformation from its spreading pattern? These are just some of the big questions facing today’s society that can potentially be solved by an emerging area of research: graph machine learning (GML) [1-3]. In addition to the promising applications [4-6], GML also has a unique theoretical appeal: it is an inherently interdisciplinary area that lies at the intersection of a number of fields such as signal processing, machine learning, statistics, network science, and differential geometry. It thus offers a context in which barriers between these different fields could be removed, new connections be made, and new insights be drawn. Despite the early promise of the field, a key open challenge in GML is the lack of understanding of model robustness and generalisation capability with respect to a perturbation of the graph data, which may come from natural noisy data characteristics (e.g., spurious correlations or irrelevant information) or malicious attempts (e.g., adversarial attacks). This is made more difficult due to the complex interplay between node and edge features, graph topology, and node/graph labels. In this project, building on top of recent preliminary results from Oxford [7] and IBM [8-9], we propose to address this challenge through the lens of graph topology and spectral analysis, and study the implication of perturbations that action in both the graph spatial domain (e.g., add or delete graph edges) and the graph spectral domain (e.g., via modification of eigenvalues/eigenvectors of the graph Laplacian) for robustness and generalisation capability. This would not only answer key theoretical questions related to GML models, but also pave the way for their safe and reliable deployment in real-world network systems. The expected outputs of this project include (1) developing theoretical foundations for characterizing robustness and generalization of GML; (2) devising novel efficient algorithms for improving GML; (3) identifying suitable real-world application domains; and (4) contributing to joint publications at top AI/ML conferences and patents.


[1] Chami et al., “Machine learning on graphs: A model and comprehensive taxonomy,” arXiv, 2020.

[2] Wu et al., “A comprehensive survey on graph neural networks,” IEEE TNNLS, 2021.

[3] Bronstein et al., “Geometric deep learning: Grids, groups, graphs, geodesics, and gauges,” arXiv,


[4] Stokes et al., “A deep learning approach to antibiotic discovery,” Cell, 2020.

[5] Monti et al., “Fake news detection on social media using geometric deep learning,” ICLR, 2019.

[6] Derrow-Pinion et al., “ETA prediction with graph neural networks in Google Maps,” CIKM, 2021.

[7] Kenlay et al., “Interpretable stability bounds for spectral graph filters,” ICML, 2021.

[8] Xu et al., “Topology attack and defense for graph neural networks: An optimization perspective,” IJCAI,


[9] Li et al., ”Generalization guarantee of training graph convolutional networks with graph topology

sampling,” ICML, 2022.

Project 2:

Title: Accelerated Design of New Sustainable Battery Materials with Artificial Intelligence Methods

PI: Saiful Islam

Summary: The provision of sustainable low-carbon energy is among the most urgent challenges of our time, and poses fundamental, exciting scientific questions. Materials performance lies at the heart of the development of green energy technologies, and computational methods now play a vital role in modelling the properties of energy materials.

However, a full understanding of the atomistic processes within materials and across interfaces that control the performance of energy storage devices such as lithium-ion batteries remains incomplete. Emerging artificial intelligence (AI) and machine learning techniques are powerful tools offering innovative capabilities for studying new battery materials on length scales from individual atoms to tens of nanometres, promising quantum-mechanical accuracy and predictive power, whilst being many orders of magnitude faster than conventional methods.

The vision of this project is the innovative use of cutting-edge machine learning simulation techniques to probe the atomic-level operation of battery materials, thereby enabling a previously missing microscopic understanding and an accelerated design of new sustainable materials with enhanced performance. Following the success of the lithium-ion battery in powering the portable electronics revolution, we will address electric vehicle application objectives of increasing the energy density and charge rates of battery electrodes and solid electrolytes, with a particular focus on how their macroscopic properties can be connected to the microscopic structure. The project will involve the creation of accurate fitting databases and machine-learning-based interatomic potentials to model the underlying atomistic behaviour of novel battery electrodes and solid electrolytes.

No equivalent concerted AI-modelling project on battery materials that inter-links such different expertise is being undertaken within any current IBM-Oxford Studentship project.

Project 3

Title: Quantum Relativistic Simulations for Dense Plasma Systems

PI: Gianluca Gregori

Summary: The equations governing quantum mechanics have been known for nearly 100 years, but even today the task of reconciling them with the equations of classical motions remains unsolved. The problem is usually expressed in terms of the trajectory (or path) that a particle follows. While in ordinary Newtonian theory a particle moves along a well-defined path, this is not true anymore in orthodox (or the Copenhagen interpretation of) quantum mechanics. This is because it is impossible to simultaneously define the position and momentum of the particle. There is, however, an alternative possibility, called Bohmian mechanics or the de Broglie-Bohm interpretation[1], whereby the particle follows a definite path, different from the classical one, determined not only by the action of local potentials, as in ordinary classical mechanics, but also by the non-local feedback of the particle's wave function on its own motion. The complexity of quantum mechanics is not circumvented in Bohm's interpretation; the particles' trajectories are determined by this non-local potential, which depends on the particles' own wave function. However, what Bohm's description offers is a recipe that allows efficient approximations for many-body quantum systems, significantly reducing computational time and costs. A stricter implementation of Bohm’s quantum mechanics into a Molecular Dynamics (MD) model — such a Bohm-MD scheme has recently been demonstrated for warm dense matter conditions[2].

The Bohmian mechanics approach also offers another advantage. Since, as pointed out by de Broglie[3], the inclusion of quantum effects are entirely equivalent to the change of the spacetime metric (that is, a conformal transformation of all the coordinates), all relativistic effects in the many-body particle interactions can be included by rewriting and solving the equations of motion in this modified metric. Such an approach can be implemented in a scalable molecular dynamics framework such as LAMMPS to simulate many-body systems with on the order of thousands of particles.

The above framework will be important for studying high-temperature dense plasmas as those founds in the interior of stars but also in inertial fusion experiments. In fact, recent experiments on laser facilities have shown that the particle dynamics is significantly altered by relativistic effects[4]. Moreover as the proposed approach treats both electrons and ions on an equal footing, unlike traditional density functional theory implementations, the plasma response to perturbations inclusive of both electron and ion dynamics can be fully calculated. Hence Green–Kubo relations can then be utilised to calculate transport coefficients including both quantum and relativistic corrections from an ab-initio model. As the fundamental property that can be extracted from the these simulations is the dynamic structure factor[5] — the simulations connect theory and experiment in directly measurable ways: measurements of the diffraction and inelastic scattering cross-sections can be compared against the calculated structure factor, enabling testing of different warm dense matter models[6]. At present transport properties such as electrical and thermal conductivity are extremely uncertain in these plasmas[7] and the proposed simulations can help shedding light on the relevant physical processes.  

The student primary task will consist in developing this new computational approach within a Molecular Dynamics framework. The student will initially focus on the numerical implementation and then apply that to realistic systems – i.e., for the calculation of transport coefficients in inertial fusion experiments. We expect this project to produce important results that are of interest not only to the fusion community, but the the broader community of researchers working in high energy density physics, planetary science and extreme materials.

[1] D. Bohm, Physical Review 85, 166 (1952).

[2] B. Larder, D. Gericke, S. Richardson, P. Mabey, T. White, and G. Gregori, Science Advances 5, eaaw1634 (2019).

[3] L. de Broglie, Nonlinear Wave Mechanics: A Causal Interpretation (Amsterdam: Elsevier, 1960).

[4] J. S. Ross, S. H. Glenzer, J. P. Palastro, B. B. Pollock, D. Price, L. Divol, G. R. Tynan, and D. H. Froula, Physical Review Letters 104, 105001 (2010).

[5] S. Ichimaru, Review of Modern Physics 54, 1017 (1982).

[6] L. B. Fletcher et al., Nature Photonics 9, 274 (2015).

[7] P. E. Grabowski et al., High Energy Density Physics 37, 100905 (2020). 

Project 4

Title: Effective Transport Coefficients in Extreme Dynamic Materials

PI: Gianluca Gregori

Summary: Characterizing and quantifying mass, momentum, and energy transport in materials under extreme conditions is vital in many areas of research, ranging from inertial confinement fusion to the behavior of matter in the interiors of giant planets and stars. With temperatures of a few electron volts (eV) and densities comparable to solids, warm dense matter (WDM) forms a key constituent of planetary interiors[1] as well as cooler stellar objects such as brown dwarfs[2] and the crust of neutron stars[3]. WDM is also produced during laser processing of solids and is an important transient state in all approaches to inertial confinement fusion[4] (ICF). Transport properties are difficult to model in WDM. Yet the direct measurement of transport properties and disentangling microscopic from macroscopic contributions remains notoriously elusive in these extreme states of matter[5]. Our goal here is to develop an experimental and numerical framework that can be used to measure effective transport in WDM and then use the experimental data to contruct a suitable representation via symbolic regression or a trained neural network[6].

Our proposed work utilizes recent advances in diagnostics and in machine learning. For the experiments, we intend to use X-ray photon correlation spectroscopy[7] (XPCS) in novel ways to extract effective transport coefficients in dynamic laser-driven materials. Recent developments in XPCS  have demonstrated it as a powerful diagnostic at X-ray free electron laser facilities, enabling the tracking of atomic scale structure and dynamics with unprecedented spatio-temporal resolution. The experiments proposed here are a necessary first step toward an eventual goal to develop XPCS to measuring transport in dynamic non-equilibrium HED materials at different scales and provide first-of-the-kind estimates for viscosity and diffusivity under different state conditions and in the presence or absence of instabilities and turbulence. Here we want to propose a novel machine learning approach to address the complex micro-physics of material strength properties and to identify their emergent behaviour via closed mathematical expressions. This is done by using a Graph Neural Network[8] (GNN) to represent the discrete description of the underlying continuum system and then applying deep learning techniques to obtain a representation of the material properties as a function of the state variables (density, temperature, etc.) The latent representation learned by the GNN is then extracted with a symbolic regression analysis[9]. Our long-term goal is the development of augmented methods to ultimately improve the design and verification integrated modelling of WDM systems, in the sense that fluid simulations using these effective transport coeffcients may now be able to capture the relevant physical processes at all scales.

One of the student’s task will consist in participating in experiments at Free Electron Laser facilities and develop an experimental diagnostics able to extract transport coefficients such as mass diffusion, thermal conduction or viscosity. Once a sufficiently large database has been obtained, the student will then train the GNN and apply symbolic regression techniques in order to extract an effective representation of those transport coefficients. We expect this project to produce important results that are of interest not only to the fusion community, but the broader community of researchers working in high energy density physics, planetary science and extreme materials.

[1] Guillot, T., 1999: Science, 286 (5437), 72–77.

[2] Brown, C. R. D. et al., 2014: Sci. Rep., 4 (1), 521.

[3] Ichimaru, S., 1982: Rev. Mod. Phys., 54 (4), 1017–1059.

[4] Hurricane, O. A. et al., 2016: Nat. Phys., 12 (8), 800–806.

[5] Grabowski, P. et al., 2020: High Energy Density Physics, 37, 100 905.

[6] Miniati F. and Gregori, G.. 2022: Sci. Rep., 12, 11709.

[7] Sutton, M., 2008: Comptes Rendus Physique, 9 (5-6), 657–66.

[8] Battaglia P. W. et al., 2018: Arxiv:1806.01261v3 (2018).

[9] Udrescu, S.-M. and Tegmark, M., 2020: Sci. Adv. 6, eaay2631. 

Project 5

Title: ML-driven fragment-based drug development using data from high-throughput X-ray crystallography and biophysical measurements

PI: Charlotte Deane and Frank Von Delft

Summary: An approach will be developed that integrates machine learning with experimental measurements for the rapid design of bioactive compounds that are suitable as chemical probes.  Chemical probes are often used as the starting point for drug discovery campaigns and can help elucidate the mechanism of molecule-target interactions.  Machine learning techniques will be developed that use structural data as input to suggest potent, synthetically tractable molecules to make within this workflow.  Techniques that better deconvolve signal from noise in biophysical assays will also be investigated. 

This project aims to generate an algorithmic formalism that achieves rapid design of bioactive compounds suitable as chemical probes. We will develop a machine learning approach that iteratively integrates experimental data from low-cost robotic organic synthesis, high-throughput crystallography (XChem), and rapid sensor-based biophysical measurements (Grating-Coupled Interferometry). The engine will be able to suggest new molecules that are potent, synthetically tractable and have good pharmacological properties. 

This approach builds on methodological discoveries made in the successful COVID Moonshot initiative, an open science consortium that Prof von Delft co-founded, which delivered preclinical candidates against SARS-CoV-2 main protease from fragment hits in 18 months with <£1m [1].  

Chemical probes are enormously powerful reagents:  as potent and selective small-molecule modulators of a protein’s function, they enable one to answer detailed mechanistic and phenotypic questions about those biological targets [2]. They often are also starting points for drug discovery campaigns. However, only a small fraction of the human proteome has an associated chemical probe [3], which therefore currently contribute little to unravelling the fundamental chemical biology of the vast number of genotype-phenotype correlations revealed by genome sequencing. The limiting factor is the high discovery cost and scientific challenge of designing potent and selective ligands. 

This project aims to resolve these limitations by bringing together advances in automation, actuated by machine learning.

Recent work by Prof von Delft demonstrates the feasibility using robotic synthesis as part of a fragment-based probe discovery campaign. The key insight is how to use biophysical assays and crystallography to analyse crude reaction mixtures [4]. This sidesteps purification, which is the rate-limiting step in chemical synthesis: protein crystallography directly confirms the ligand’s chemical structure, while sensor-based biophysics provides binding kinetics. 

The synthesis planning methodology is driven with machine learning, possibly using an already developed computer aided synthesis planning tool produced by IBM, IBM RXN [5]. Utilizing synthetic accessibility reaction scores as a post hoc filter after compound generation generated with IBM RXN, is a possible avenue to explore. Similarly, a deep learning compound price prediction, CoPriNet, could be used instead of the computationally expensive retrosynthetic based synthetic accessibility reaction score, enriching our compound generation approach with real-world compound prices [6].

This project will address the two interrelated questions that must be answered for these proof-of-concept successes to become an effective platform for probe discovery. First, how can machine learning take structural biology data as input to suggest new molecules to make. Second, how can we deconvolve signal amid noise in biophysical assays when the input is a crude reaction mixture. 

To learn from structural biology data, we will investigate different techniques to describe protein-ligand complexes, ranging from protein-ligand interaction fingerprint to machine learning-based docking score and full free energy calculations.  Structural data may be described by employing approaches such as graph neural networks.  We will then investigate approaches such as semi-supervised learning - where the structural data alone provide physical information – as well as a supervised approach where those descriptors are related to biochemical assay data through machine learning. Using the exponential increase in structural biology throughput possible at the Prof von Delft’s XChem facility, we will develop a new class of descriptors.  This work will utilize open source software, including tools developed within our teams, and new developments can be contributed to the open source community.

A related question is deconvoluting biophysical assay data, as the assays are done directly to crude reaction mixtures rather than purified product. Here, the challenge lies in extracting multiple binding constants from a single, essentially multiplexed, experiment. We will investigate approaches such as Bayesian statistics to rigorously determine the number of chemical components in the system, as well as incorporating ideas such as using the predicted binding affinity from the structural biology-based machine learning model described above as a prior.

The von Delft group and The Oxford Protein Informatics group (OPIG) led by Professor Deane have complementary expertise, and a collaboration is essential to achieving the project goals. The von Delft group has a track record in high throughput structural biology, robotic synthesis and biophysical assays. OPIG is a world leading group in the area of AI for small molecule design and all code from OPIG is made available as open source.  IBM researchers will contribute skills at the intersection of molecular discovery, computational chemistry and AI to this collaboration.  Dr. Cornell leads the drug discovery strategy within IBM Accelerated Discovery Research and Dr. Morrone develops computational methods to combine structural data and simulation with artificial intelligence.


[1] Chodera et al., “Crowdsourcing drug discovery for pandemics”, Nature Chemistry, 12, 581 (2020)

[2] Arrowsmith et al., “The promise and peril of chemical probes”, Nature Chemical Biology, 11, 536 (2015)

[3] Carter et al., “Target 2035: probing the human proteome”, Drug Discovery Today, 24, 2111 (2019)

[4] Baker et al., “Rapid optimisation of fragments and hits to lead compounds from screening of crude reaction mixtures”, Communications Chemistry, 3, 122 (2020)

[5] Schwaller et al., “Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction”, ACS Central Science, 5, 1572 (2019) 

[6] Sanchez-Garcia et al., “CoPriNet: Deep learning compound price prediction for use in de novo molecule generation and prioritization”, ChemRxiv, (2022).

Project 6

Title: Taking the structure of proteins into account: predicting if infections are resistant to β-lactam antibiotics using graph-based convolutional neural networks 

PI: Phil Fowler

Summary: This project will use machine learning to predict whether novel protein mutations confer resistance to specific antibiotics.  Antimicrobial Resistance is a great concern to modern medicine.  The Fowler group has previously used simple machine learning and physical simulation to address this problem.  We aim to use graph neural network techniques developed by the IBM team to create advanced deep learning models and then train, validate and test the models on the large datasets we possess. 

 Project description

Antimicrobial Resistance (AMR) is a growing threat to modern medicine; not only would infectious disease claim many more lives than it already does but it would adversely impact many treatments that rely on prophylaxis, e.g. many anti-cancer therapies. Work in the Fowler group focusses on developing methods to predict de novo whether novel protein mutations confer resistance (or not) to specific antibiotics [1-3]. To date we’ve built a suite of simple machine learning models able to predict resistance to rifampicin, isoniazid, levofloxacin, moxifloxacin and pyrazinamde. We’ve also used relative binding free energy methods, which require large numbers of classical molecular dynamics simulations, to predict how the binding free energy changes upon introducing a mutation into the target protein. Our work is translational, and our goal is to make these models available to public health agencies, such as UKHSA, who are using genetics to diagnose Mycobacterial infections. 

Due to the much higher degree of genetic variation and to maximise the information derived from the clinical samples, the student will work closely with IBM Research to create and then train graph-based convolutional neural networks that capture the topology of the protein and the protein in complex with ligands in the connectivity of the hidden layers [4]. Crucially, such models can be trained from gene sequences rather than mutations and so can innately deal with the greater genetic variation.  These techniques incorporate structure-based approaches within a deep learning framework and have been successfully validated and applied to the binding of proteins to small molecules and biologics.

This work will also utilize open source software.  The Fowler group has several open source tools in this space.  In the near future, the IBM team is planning to release open source graph convolutional neural network software that will serve as the underpinning for the development of the proposed machine learning tools.     New developments are intended to be contributed to open source efforts.

To date we've focussed on proteins in S. aureus and M. tuberculosis and during 2023 are turning our attention to the ESKAPE pathogens, specifically E. coliS. aureusK. pneumoniae and the β-lactam antibiotics (and potentially also β-lactam inhibitors).   We have large datasets for each of the above pathogens comprising clinical samples that have been whole genome sequenced and had their susceptibility to a range of antibiotics tested and there are also community resources and datasets available. These will be used to train, validate and test the models. The student would optimise the approach and there are many avenues that could be explored, including how to incorporate multiple protein structures and the effect of protein dynamics on prediction.

The project would suit a student wishing to learn and apply machine learning to an important biomedical problem who wants to work in a fast-paced and highly interdisciplinary environment.   It could therefore suit students from a broad range of backgrounds — we do not expect a student to have experience in everything below before they start!

 Skills and knowledge

 During the DPhil, the student would learn how to

  • use Linux
  • code in Python, a high-level, easy-to-use programming language
  • manipulate large datasets using Pandas
  • use software-engineering best practices such as unit testing, version control and continuous integration
  • train simple machine learning models
  • construct graph-based convolutional neural networks for specific genes that can predict antibiotic susceptibility

 They would also gain knowledge in

  • bacterial genetics
  • protein structure and chemistry
  • computational chemistry and drug discovery
  • clinical microbiology, specifically the mechanism of resistance for antibiotics such as the β-lactams


 [1] Fowler, et al. “Robust Prediction of Resistance to Trimethoprim in Staphylococcus aureus.” Cell Chem. Biol. 25, 339 (2018).

[2] Brankin, et al. “Predicting antibiotic resistance in complex protein targets using alchemical free energy methods.” J. Comp. Chem. (2022).
[3] Carter, et al. “Prediction of pyrazinamide resistance in Mycobacterium tuberculosis using structure-based machine learning approaches.” bioRxiv (2022).
[4] Morrone, et al. “Combining docking pose rank and structure with deep learning improves protein–ligand binding mode prediction over a baseline docking approach.” J. Chem. Inf. Model. 60, 4170 (2020). 

Project 7

Title: Defining computation and connectivity in neuronal population activity underlying motor learning

PI: Andrew Sharott

Summary: Neural network structure constrains the activity dynamics of the brain. Specifically, learning of movements guided by the outcome of previous actions leads to adaptations in the motor cortical network and its activity. To understand these mechanisms on the cellular level would require simultaneous recordings from hundreds of local neurons at millisecond timescale in vivo during learning of a skilled movement. We have successfully established an approach to simultaneously record thousands of neurons across motor regions in mice, using recently developed high-density electrode silicon-probes in combination with machine-learning based kinematic analysis and cell-type specific optogenetic modulation.

Motivated by recent work that link structure of population activity to the underlying synaptic connectivity (Dahmen et al., 2022) and our experience in cortical microcircuits (Peng et al., 2019, 2022), we aim to identify core changes in neuronal microcircuits that underlie motor learning and execution. We will develop novel approaches to extract activity signatures reflecting plastic changes on the local synaptic level and model how these constrain the overall dimensionality of neuronal population activity. The results will provide a microcircuit level understanding of learning in motor circuits and lay the groundwork to study neural network architecture in high-density electrophysiological recordings.


Dahmen, D. et al. Strong and localized recurrence controls dimensionality of neural activity across brain areas. Biorxiv 2020.11.02.365072 (2022).

Peng, Y. et al. High-throughput microcircuit analysis of individual human brains through next-generation multineuron patch-clamp. Elife (2019).

Peng, Y. et al. Spatially structured inhibition defined by polarized parvalbumin interneuron axons promotes head direction tuning. Science Advances (2021).

Project 8

Title: Optimising therapy for brain disorders through AI-refined deep brain stimulation

PI: Hayrie Cagnan

Summary:Brain stimulation is extensively used to modulate neural activity in order to alter behaviour. In recent years, closed-loop stimulation techniques have gained increasing traction to sense a biomarker such as elevated neural activity patterns, and deliver stimulation in time with such events. Closed-loop stimulation techniques are used both for establishing a causal link between behaviour and neural activity, and also to treat various neurological and psychiatric conditions. Building on our recent work (West et al 2022, Cagnan et al 2017), this PhD project aims to formalise stimulation parametrisation by using theoretical models of brain circuits in combination with state of the art machine learning approaches. Specifically, we will train artificial neural networks to classify discrete brain states of interest and optimise stimulation parameters to achieve precise manipulation of activity propagating across brain circuits. The successful development of such an approach would provide a powerful framework to guide next generation stimulation strategies both for usage in basic science and clinical applications.

Project 11

Title: Foundations of Stochastic Gradient Descent (and Generalization)

PI: Patrick Rebeschini 

Summary: Stochastic gradient descent is one of the most widely used algorithmic paradigms in modern machine learning. Despite its popularity, there are many open questions related to its generalization capabilities. For instance, while there is preliminary evidence that early-stopped gradient descent applied to over-parameterized models is robust with respect to label mispecifications, a complete theory that can account for this phenomenon is currently lacking. The goal of this project is to rigorously investigate the robustness properties of early-stopped gradient descent from a theoretical point of view in simplified settings involving linear models, and to establish novel connections of such a methodology with the field of distributionally-robust optimization. The project will combine tools from the study of random structures in high-dimensional probability (e.g., concentration inequalities, theory of optimal transport) with the general framework of gradient and mirror descent methods from optimization and online learning (e.g., regularization). 

The project is mathematically-oriented and it is based on topics covered in the lecture notes of the Oxford course

Numerical experiments might be used (but not necessarily) to validate theoretical findings, most likely using Python.


Project 12

Title: Modelling proton delocalization in hydrogen-bond networks with quantum simulators 

PI: Tristan Farrow

Summary: A widely relevant use case for quantum simulators in biochemistry and pharmacology involves the study of tautomerization – when protons transfer between molecular sites resulting in different configurations of a chemical compound. Tautomerization plays an important role in canonical biochemical reactions by providing a pathway that can determine the aromatization of molecules, enable catalytic enzyme reactions, and determine the tautomeric forms of photoproteins in luminescence [1]. The molecular structures and their quantum dynamics need to be modelled as open systems, where the environment is the electronic structure to which the ’proton waves’ couple. Traditional quantum chemistry assumes that protons are fixed and lacks the tools to model proton delocalization and decoherence processes to describe the open system dynamics. Quantum computers have the potential to simulate quantum systems up to sizes that are no longer affordable classically or to handle strongly entangled systems, potentially between a system and its environment. Furthermore, present-day quantum computers are inherently noisy. A large body of work has focused on developing error mitigation methods to suppress the noise. However, little work as been done to explore how noise can be engineered to help simulate quantum systems surrounded by a bath, such as the molecular structures discussed here. Using quantum simulators and a creative use of noise in canonical biochemical reactions set within specific biosystems offers a new opportunity to innovate and to gain fresh insights. Here we present three possible systems of interest of increasing complexity. A simple model for enzyme catalysis, treated as an open quantum system, is presented as a use case in section I. The theoretical model for this system has been developed in [2] and provides a possible framework for benchmarking quantum simulations. A general scheme for protein luminescence that relies on proton delocalization is presented in the second use case in section II. Protein luminescence presents a particularly interesting opportunity it intersects with my experimental work that could be used to evaluate fitting parameters where possible for our theoretical models. A third use case extending this work models point mutations in DNA and is presented in section III. All three projects share proton delocalisation as a determinant quantum process.  


[1] Osamu, S., Bioluminescence: Chemical Principles and Methods, World Scientific Publishing, ISBN-10: 9814366080 (2012).

[2] Pusuluk, O., Farrow, T., Deliduman, C., Burnett, K., and Vedral, V., 2018. Proton tunneling in hydrogen bonds and its possible implications in an induced-fit model of enzyme catalysis. Proc. R. Soc. A 474, 20180037 (2018).  

Project 13

Title: Accelerated Modelling of Reaction Pathways using Machine Learning for Carbon Capture Materials

PI: Fernanda Duarte

Summary: The climate crisis, due to anthropogenic emissions, is likely the single largest issue facing the planet in the 21st Century. Materials to aid in curbing, and ultimately removing, carbon emissions are vital to prevent unsustainable climate change. To meet this challenge, novel materials and new capabilities are required. Accurate and reliable predictions of material capabilities are a critical part in material design. Computational methods play an increasing role in providing these capabilities prior to laboratory work. This proposal seeks to address the issue of accurately, reliably, and efficiently computing reaction pathways and using such methods to improve carbon capture materials.

There are numerous carbon capture materials proposed, but only a handful are commercially operating to date. Among the commercial alternatives, most utilize solvents to perform carbon capture; however, the exact details of the capture mechanism are often unknown. Additionally, for these materials to be used in a cyclic manner, the carbon dioxide must be removed, and the material reformed in an economical and sustainable manner. This is known as the regeneration energy, of the major reasons for the high operating costs of carbon capture units.

The project will develop accurate and efficient Machine Learning (ML) models to elucidate reaction pathways and evaluate novel designs. Where data exist in the open literature, we will use this, but we will also seek to generate our own data specific to the domain at hand, which currently lacks such data sets.

The overall aim of this project will be to develop ML models applicable to describing chemical reaction pathways. We will show these models operating on well-known and understood systems prior to using them to improve the current generation of carbon capture materials. 

Project 14

Title: Using Carbon Dioxide to Make Plastics and Materials – Circular Carbon Economies

PI: Charlotte Williams

Summary: The project combines a long-standing expertise in both IBM and Oxford Chemistry, Williams Research group, into ensuring the next generation polymers and plastics maximise carbon dioxide recycling and minimize negative environmental impacts.  This will be achieved through investigation of impacts throughout the lifecycle – from ensuring the monomers used are bio-derived or even waste carbon dioxide, to delivering efficient manufacturing processes, to designing polymers with the highest performances to minimize need for additives to ensuring all materials are recyclable after use. Research will exploit discoveries in the Williams group allowing carbon dioxide and bio-derived monomers to be transformed into polymers, plastics and elastics.  The manufacturing process benefits from automation expertise at IBM, and will be underpinned by computational measurements of both catalytic cycle and manufacturing process. The project focusses on efficient catalysis to make carbon dioxide derived thermoplastics for use in future electronics and electrical applications, including as an insulator and in heat management systems, which show lower greenhouse gas emissions throughout their life cycles. The polymers are designed to be easily recycled, by mechanical and chemical processes, at end-life and research will investigate the optimum methods (catalysts/processes) for these recycling technologies.  The project will involve a period of secondment and close collaboration with the IBM Almaden (San Francisco) team headed up by Jim Hedrick. The research at IBM focusses on using continuous flow methods in polymer synthesis to accelerate manufacturing and improve control over polymer properties.  Polymer property and processing assessments will be made between the laboratories at U. Oxford and IBM Almaden. 

Project 15

Title: Developing geospatial foundation models for climate model evaluation and the detection of extreme climate events 

PI:  Philip Stier

Summary: Foundation models are a general class of AI models trained (generally self-supervised) on a large set of multimodal data. The resulting foundation model can be fine-tuned to solve a wide array of downstream tasks. Despite the methodology is general and applicable to different domains and applications, current popular examples are mostly focused on natural language processing (e.g. GPT-3 for natural language and Dall-E for text-to-image tasks). As foundation models are complex and trained on large datasets, they tend to exhibit an emergent property where a system’s behaviour is implicitly induced rather than explicitly specified. This is especially advantageous for many applications in climate science where the underlying physical processes are sometimes too complex for a limited amount of data to capture, or high quality data for training models able to detect climate events of interest might be scarce.

Despite these advantages and the increased availability of large volume of high-resolution climate-related data, the use of foundation models in climate science it is still in its infancy. This is partially due to observed climate datasets (for example, satellite images [e.g. Sentinel and Landsat], time-series [e.g. whether station data and rain gauges], 3D signals [e.g. LiDAR point clouds]) often include spatially heterogeneous and asynchronous data, meaning that not all data modalities are available at the same location and time.

Finally, although there is increased availability of data, it is not clear how much data is actually needed to train foundation models and then obtain good results in downstream climate science-oriented tasks. The aim of this PhD project is to develop new modular deep learning architectures for foundation models that allow one to deal with the multivariate nature of climate data and its spatio-temporal intermittence. The project will explore transformer-based architectures to allow parallelization between modalities before the extracted data representation is recombined. Focusing on training efficiency and computation, the project may also investigate whether it is possible to understand the added value of bringing in an additional modality or sets of training samples.

Ultimately, the foundation models developed during the project will be tested and compared to the regular paradigm (e.g. developing a bespoke model for each application), in downstream tasks. This might include earth observation for climate hazards (e.g. flood, wildfire, landslide, drought) and climate model evaluation against observations. If successful, these foundation models will be an extremely powerful tool that will enable more efficient and accurate climate impact assessment and earth observation.

Further reading:

Bommassani et al 2021: On the opportunities and risks of foundation models


Lacoste et al 2021: Toward Foundation Models for Earth Monitoring:

Proposal for a Climate Change Benchmark (

Project 16

Title: Advancing synthesis prediction with machine learning – A data driven/mechanistic approach

PI: Fernanda Duarte

Summary: The project will apply the latest machine learning (ML) techniques to chemical applications, including the exploration of reaction pathways toward medicinally relevant scaffolds. The aim will be to develop interpretable ML algorithms that facilitate the prediction of synthetic routes and provide a mechanistic understanding of their outcome.

This project will enable the student to explore fundamental scientific questions at the interface of chemistry and machine learning and apply these insights to tackle timely real-world applications. It will also provide the opportunity to work with multi-disciplinary teams in academia and industry. The group of Prof. Fernanda Duarte will provide world-leading expertise in reaction pathway modelling and automation, while the team at IBM Research will bring expertise in the development of computational chemistry software and AI techniques.

Applicants must have, or expect to obtain, a Master’s (or equivalent) degree in Chemistry, Physics, Computer science or a related subject. Previous experience in computational chemistry or machine learning would be an advantage.

Application Deadline

This course is open until Wednesday 1 March 2023 for applications to thefour projects shown.  Three funded places are available for students with Home Fees.

Programme Director

Professor Phil Biggin

Supported By

IBM, EPSRC, Oxford University

Further Information

Project Booklet