The Cambridge Centre for AI in Medicine will be represented at the Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS 2022) with 10 papers accepted for publication and 4 papers presented at workshops.
NeurIPS, running from 28 November – 9 December, is globally renowned for presenting and publishing cutting-edge research and the conference includes invited talks, demonstrations, symposia, and oral and poster presentations of refereed papers. A professional exposition focusing on machine learning in practice, a series of tutorials, and topical workshops that provide a less formal setting for the exchange of ideas takes place alongside the conference.
Along with ICML and ICLR, NeurIPS is one of the 3 primary conferences of high impact in machine learning and artificial intelligence research.
The Cambridge Centre for AI in Medicine and its partners at AstraZeneca and GSK are at the forefront of research in machine learning. From better understanding human decision-making to individualised treatment effects and machine learning interpretability, our team – made up of expert minds from the Cambridge Machine Learning Group (José Miguel Hernández-Lobato), University of Cambridge Department of Computer Science and Technology (Pietro Liò), and the van der Schaar Lab (Mihaela van der Schaar) – is working on revolutionary technologies that will shape the future of healthcare.
Collectively, these papers touch on some of the most important areas within the Centre’s extensive research agenda, including discovery using machine learning, data imputation, deep generative models, synthetic data, graph neural networks, AutoML, deep learning, individualised treatment effects, and, last but not least, unsupervised learning.
Here are the highlights of the CCAIM contributions to NeurIPS 2022:
Abstract
Variational Autoencoders (VAEs) have recently been highly successful at imputing and acquiring heterogeneous missing data. However, within this specific application domain, existing VAE methods are restricted by using only one layer of latent variables and strictly Gaussian posterior approximations.
To address these limitations, we present HH-VAEM, a Hierarchical VAE model for mixed-type incomplete data that uses Hamiltonian Monte Carlo with automatic hyper-parameter tuning for improved approximate inference. Our experiments show that HH-VAEM outperforms existing baselines in the tasks of missing data imputation and supervised learning with missing features. Finally, we also present a sampling-based approach for efficiently computing the information gain when missing features are to be acquired with HH-VAEM.
Our experiments show that this sampling-based approach is superior to alternatives based on Gaussian approximations.
W
Why is this important?
Data with missing values is very common in real-world problems, especially in the health-care setting. This work describes new hierarchical deep generative models for modelling data with missing values. These models outperform previous approaches which are not hierarchical. The paper also describes state-of-the-art methods for acquiring missing data efficiently, that is, for selecting which missing values are most useful to collect next in order to improve the performance of a predictive model. This problem has important applications in health-care. For example, in the case where the predictive model is used to identify a patient’s disease based on medical tests and the missing data are the tests not carried out yet on the patient.
Abstract
In many real world problems, features do not act alone but in combination with each other. For example, in genomics, diseases might not be caused by any single mutation but require the presence of multiple mutations. Prior work on feature selection either seeks to identify individual features or can only determine relevant groups from a predefined set.
We investigate the problem of discovering groups of predictive features without predefined grouping. To do so, we define predictive groups in terms of linear and non-linear interactions between features. We introduce a novel deep learning architecture that uses an ensemble of feature selection models to find predictive groups, without requiring candidate groups to be provided. The selected groups are sparse and exhibit minimum overlap.
Furthermore, we propose a new metric to measure similarity between discovered groups and the ground truth. We demonstrate the utility of our model on multiple synthetic tasks and semi-synthetic chemistry datasets, where the ground truth structure is known, as well as an image dataset and a real-world cancer dataset.
Why is this important?
Standard feature selection only provides a list of features, without any information about how they interact or how we may be able to group them together. However, in nature we know there are many complicated, but separate interactions between features, such as how genes interact to lead to a given disease. Our model is able to not only discover relevant features, but also group them based on how they interact, so that we may gain further insight into the underlying mechanism behind a problem. Further, we have introduced this new challenge to the community, presenting a solution to this unseen problem presenting a new direction for novel research to take place.
Abstract
Geometric deep learning has well-motivated applications in the context of biology, a domain where relational structure in datasets can be meaningfully leveraged. Currently, efforts in both geometric deep learning and, more broadly, deep learning applied to biomolecular tasks have been hampered by a scarcity of appropriate datasets accessible to domain specialists and machine learning researchers alike.
However, there has been little exploration of how to best to integrate and construct geometric representations of these datatypes. To address this, we introduce Graphein as a turn-key tool for transforming raw data from widely-used bioinformatics databases into machine learning-ready datasets in a high-throughput and flexible manner. Graphein is a Python library for constructing graph and surface-mesh representations of protein structures and biological interaction networks for computational analysis. Graphein provides utilities for data retrieval from widely-used bioinformatics databases for structural data, including the Protein Data Bank, the recently-released AlphaFold Structure Database, and for biomolecular interaction networks from STRINGdb, BioGrid, TRRUST and RegNetwork.
The library interfaces with popular geometric deep learning libraries: DGL, PyTorch Geometric and PyTorch3D though remains framework agnostic as it is built on top of the PyData ecosystem to enable inter-operability with scientific computing tools and libraries. Graphein is designed to be highly flexible, allowing the user to specify each step of the data preparation, scalable to facilitate working with large protein complexes and interaction graphs, and contains useful pre-processing tools for preparing experimental files. Graphein facilitates network-based, graph-theoretic and topological analyses of structural and interaction datasets in a high-throughput manner. As example workflows, we make available two new protein structure-related datasets, previously unused by the geometric deep learning community.
We envision that Graphein will facilitate developments in computational biology, graph representation learning and drug discovery.
Why is this important?
B
TBC
Abstract
Concept-based explanations permit to understand the predictions of a deep neural network (DNN) through the lens of concepts specified by users. Existing methods assume that the examples illustrating a concept are mapped in a fixed direction of the DNN’s latent space. When this holds true, the concept can be represented by a concept activation vector (CAV) pointing in that direction.
In this work, we propose to relax this assumption by allowing concept examples to be scattered across different clusters in the DNN’s latent space. Each concept is then represented by a region of the DNN’s latent space that includes these clusters and that we call concept activation region (CAR). To formalize this idea, we introduce an extension of the CAV formalism that is based on the kernel trick and support vector classifiers. This CAR formalism yields global concept-based explanations and local concept-based feature importance. We prove that CAR explanations built with radial kernels are invariant under latent space isometries.
In this way, CAR assigns the same explanations to latent spaces that have the same geometry. We further demonstrate empirically that CARs offer (1) more accurate descriptions of how concepts are scattered in the DNN’s latent space; (2) global explanations that are closer to human concept annotations and (3) concept-based feature importance that meaningfully relate concepts with each other. Finally, we use CARs to show that DNNs can autonomously rediscover known scientific concepts, such as the prostate cancer grading system.
Why is this important?
Concept-Based explanations constitute a promising new type of explanation.
They allows to user to probe complex deep learning models through a set
of human-friendly concepts. Concept-based explanations are user-centric since they give freedom to the user to define concepts by providing positive examples (examples with the concept) and negative examples (examples without the concept).
For instance, it is possible for a doctor to define the concept “grade 3 prostate cancer patient” by listing grade 3 and non grade 3 patients. It is then possible to highlight the relevance of each user-defined concept for the model’s prediction. Our work generalizes previous concept-based explanation methods by relaxing the stringent assumption that concept positives and negatives are linearly separable in the model’s representation space.
Abstract
Estimating personalized effects of treatments is a complex, yet pervasive problem. To tackle it, recent developments in the machine learning (ML) literature on heterogeneous treatment effect estimation gave rise to many sophisticated, but opaque, tools: due to their flexibility, modularity and ability to learn constrained representations, neural networks in particular have become central to this literature.
Unfortunately, the assets of such black boxes come at a cost: models typically involve countless nontrivial operations, making it difficult to understand what they have learned. Yet, understanding these models can be crucial — in a medical context, for example, discovered knowledge on treatment effect heterogeneity could inform treatment prescription in clinical practice. In this work, we therefore use post-hoc feature importance methods to identify features that influence the model’s predictions. This allows us to evaluate treatment effect estimators along a new and important dimension that has been overlooked in previous work: We construct a benchmarking environment to empirically investigate the ability of personalized treatment effect models to identify predictive covariates — covariates that determine differential responses to treatment.
Our benchmarking environment then enables us to provide new insight into the strengths and weaknesses of different types of treatment effects models as we modulate different challenges specific to treatment effect estimation — e.g. the ratio of prognostic to predictive information, the possible nonlinearity of potential outcomes and the presence and type of confounding.
Why is this important?
Estimating personalized effects of treatments is a complex problem important in many fields, and especially in healthcare. The recent machine learning literature has therefore proposed many sophisticated — yet opaque –models to estimate such effects, which have mainly been evaluated by checking whether they get the underlying numbers about right. In practice, however, the ultimate goal when using such models is often to derive more actionable insight and scientific discovery: in particular, practitioners are often interested in learning about the drivers of effect heterogeneity. When testing a new drug in a clinical trial, for example, such insight could be used to construct both patient groups for whom a treatment works exceedingly well, as well as patient groups that should better be excluded from future trials and prescription due to lack of effect or even harm. In this paper, we therefore empirically investigate how existing estimation strategies fare at identifying true drivers of treatment effect heterogeneity.
Abstract
Consider learning a decision support assistant to serve as an intermediary between (oracle) expert behavior and (imperfect) human behavior: At each time, the algorithm observes an action chosen by a fallible agent, and decides whether to accept that agent’s decision, intervene with an alternative, or request the expert’s opinion.
For instance, in clinical diagnosis, fully-autonomous machine behavior is often beyond ethical affordances, thus real-world decision support is often limited to monitoring and forecasting. Instead, such an intermediary would strike a prudent balance between the former (purely prescriptive) and latter (purely descriptive) approaches, while providing an efficient interface between human mistakes and expert feedback. In this work, we first formalize the sequential problem of online decision mediation—that is, of simultaneously learning and evaluating mediator policies from scratch with abstentive feedback: In each round, deferring to the oracle obviates the risk of error, but incurs an upfront penalty, and reveals the otherwise hidden expert action as a new training data point. Second, we motivate and propose a solution that seeks to trade off (immediate) loss terms against (future) improvements in generalization error; in doing so, we identify why conventional bandit algorithms may fail.
Finally, through experiments and sensitivities on a variety of datasets, we illustrate consistent gains over applicable benchmarks on performance measures with respect to the mediator policy, the learned model, and the decision-making system as a whole.
Why is this important?
TBC
Abstract
High model performance, on average, can hide that models may systematically underperform on subgroups of the data. We consider the tabular setting, which surfaces the unique issue of outcome heterogeneity – this is prevalent in areas such as healthcare, where patients with similar features can have different outcomes, thus making reliable predictions challenging.
To tackle this, we propose Data-IQ, a framework to systematically stratify examples into subgroups with respect to their outcomes. We do this by analyzing the behavior of individual examples during training, based on their predictive confidence and, importantly, the aleatoric (data) uncertainty. Capturing the aleatoric uncertainty permits a principled characterization and then subsequent stratification of data examples into three distinct subgroups (Easy, Ambiguous, Hard). We experimentally demonstrate the benefits of Data-IQ on four real-world medical datasets. We show that Data-IQ’s characterization of examples is most robust to variation across similarly performant (yet different) models, compared to baselines.
Since Data-IQ can be used with any ML model (including neural networks, gradient boosting etc.), this property ensures consistency of data characterization, while allowing flexible model selection. Taking this a step further, we demonstrate that the subgroups enable us to construct new approaches to both feature acquisition and dataset selection. Furthermore, we highlight how the subgroups can inform reliable model usage, noting the significant impact of the Ambiguous subgroup on model generalization.
Why is this important?
Characterizing data quality is an undervalued problem in machine learning, often considered as merely operational. Yet failing to account for it can have immense practical harm. This highlights the dire need for ML-aware data quality that is both principled and practical for a variety of ML models and use cases. Data-IQ is a systematic data-centric AI framework that can be used to characterize data samples into subgroups, applicable to a wide class of ML models. Data-IQ provides a useful tool for practitioners to understand the quality of data, guide feature acquisition or even help better sculpt their datasets.
Abstract
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data – instead given access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
In scenarios from finance to the medical sciences, and even consumer practice, stakeholders have developed models on private data they either cannot, or do not want to, share. Given the value and legislation surrounding personal information, it is not surprising that only the models, and not the data, will be released – the pertinent question becoming: how best to use these models? Previous work has focused on global model selection or ensembling, with the result of a single final model across the feature space. Machine learning models perform notoriously poorly on data outside their training domain however, and so we argue that when ensembling models the weightings for individual instances must reflect their respective domains – in other words models that are more likely to have seen information on that instance should have more attention paid to them.
We introduce a method for such an instance-wise ensembling of models, including a novel representation learning step for handling sparse high-dimensional domains. Finally, we demonstrate the need and generalisability of our method on classical machine learning tasks as well as highlighting a real world use case in the pharmacological setting of vancomycin precision dosing.
Why is this important?
This area is in important – yet unexplored – problem since, especially in healthcare, it is very common to not have original labelled training data available to us given privacy considerations. Without having trained the models ourselves, it is a very challenging task for us to know how well the models given to us should perform, making the task of choosing the most appropriate model (or ensemble) difficult. In the paper we introduce the first method for addressing this problem with instance-wise predictions allowing for personalisation to the test context. There are potentially many applications for a good unsupervised ensemble method, one example that we explore in the paper being precision dosing using population pharmacodynamics models.
Abstract
Consider the problem of improving the estimation of conditional average treatment effects (CATE) for a target domain of interest by leveraging related information from a source domain with a different feature space.
This heterogeneous transfer learning problem for CATE estimation is ubiquitous in areas such as healthcare where we may wish to evaluate the effectiveness of a treatment for a new patient population for which different clinical covariates and limited data are available. In this paper, we address this problem by introducing several building blocks that use representation learning to handle the heterogeneous feature spaces and a flexible multi-task architecture with shared and private layers to transfer information between potential outcome functions across domains.
Then, we show how these building blocks can be used to recover transfer learning equivalents of the standard CATE learners. On a new semi-synthetic data simulation benchmark for heterogeneous transfer learning we not only demonstrate performance improvements of our heterogeneous transfer causal effect learners across datasets, but also provide insights into the differences between these learners from a transfer perspective.
Why is this important?
TBC
Abstract
Structure-based drug design (SBDD) aims to design small-molecule ligands that bind with high affinity and specificity to pre-determined protein targets. Traditional SBDD pipelines start with large-scale docking of compound libraries from public databases, thus limiting the exploration of chemical space to existent previously studied regions.
Recent machine learning methods approached this problem using an atom-by-atom generation approach, which is computationally expensive. In this paper, we formulate SBDD as a 3D-conditional generation problem and present DiffSBDD, an E(3)-equivariant 3D-conditional diffusion model that generates novel ligands conditioned on protein pockets. Furthermore, we curate a new dataset of experimentally determined binding complex data from Binding MOAD to provide a realistic binding scenario that complements the synthetic CrossDocked dataset.
Comprehensive in silico experiments demonstrate the efficiency of DiffSBDD in generating novel and diverse drug-like ligands that engage protein pockets with high binding energies as predicted by in silico docking.
Why is this important?
TBC
Abstract
Closed-form differential equations, including partial differential equations and higher-order ordinary differential equations, are one of the most important tools used by scientists to model and better understand natural phenomena.
Discovering these equations directly from data is challenging because it requires modeling relationships between various derivatives that are not observed in the data (equation-data mismatch) and it involves searching across a huge space of possible equations. Current approaches make strong assumptions about the form of the equation and thus fail to discover many well-known systems.
Moreover, many of them resolve the equation-data mismatch by estimating the derivatives, which makes them inadequate for noisy and infrequently sampled systems. To this end, we propose D-CIPHER, which is robust to measurement artifacts and can uncover a new and very general class of differential equations.
We further design a novel optimization procedure, CoLLie, to help D-CIPHER search through this class efficiently. Finally, we demonstrate empirically that it can discover many well-known equations that are beyond the capabilities of current methods.
Why is this important?
Previous techniques for PDE discovery could only discover equations in a particular functional form and were often very susceptible to noise. D-CIPHER allows some parts of the equation to be any closed-form function and is much more robust to noise by employing the variational formulation of PDEs. This improved flexibility might prove useful in discovering age-dependent epidemiological models, predator-prey models with age-structure or population models structured by age, size, and spatial position. D-CIPHER, unlike previous techniques, can potentially discover any functional form of the crucial elements of these equations signifying the rates of mortality, infection, recovery, or growth. We also examine the landscape of different types of PDEs from the discovery perspective. In particular, we propose a new general class of PDEs that admit the variational formulation (which allows to circumvent the derivative estimation). This definition outlines the current limits of any method that circumvents derivative estimation in that way.
Practical Approaches for Fair Learning with Multitype and Multivariate Sensitive Attributes
Tennison Liu, Alex Chan, Boris van Breugel, Mihaela van der Schaar
Arne Schneuing, Yuanqi Du, Charles Harris, Arian R. Jamasb, Ilia Igashov, Weitao Du, Tom Blundell, Pietro Lió, Carla Gomes, Max Welling, Michael Bronstein, Bruno Correia
Workshop Paper – Algorithmic Fairness through the Lens of Causality and Privacy
Abstract
TBC
Why is this important?
There is a clear need for scalable and practical methods that can be easily incorporated into ML operations, in order to make sure they don’t inadvertently disadvantage one group over another. The ML community has responded with a number of methods designed to ensure that predictive models are fair, most of the focus has been on single, binary variables. The problem is, however, that in many practical applications we may have multiple attributes which we would like to protect, for example both race and sex – indeed U.S. federal law protects groups from discrimination based on nine protected classes. To address this, we introduce a fairness measure based on cross-covariance operators in RKHS that can quantify fairness in multi-type, multi-variate scenarios. Subsequently, we also introduce a normalized fairness metric and learning regularizer, two practical tools that plug a clear gap in fairML literature.
Adaptively Identifying Patient Populations with Treatment Benefit in Clinical Trials
Alicia Curth, Alihan Hüyük, Mihaela van der Schaar
Arne Schneuing, Yuanqi Du, Charles Harris, Arian R. Jamasb, Ilia Igashov, Weitao Du, Tom Blundell, Pietro Lió, Carla Gomes, Max Welling, Michael Bronstein, Bruno Correia
Workshop Paper – Women in Machine Learning (WiML) Workshop
Abstract
We study the problem of adaptively identifying patient subpopulations that benefit from a given treatment during a confirmatory clinical trial. This type of adaptive clinical trial, often referred to as adaptive enrichment design, has been thoroughly studied in biostatistics with a focus on a limited number of subgroups (typically two) which make up (sub)populations, and a small number of interim analysis points.
In this paper, we aim to relax classical restrictions on such designs and investigate how to incorporate ideas from the recent machine learning literature on adaptive and online experimentation to make trials more flexible and efficient. We find that the unique characteristics of the subpopulation selection problem — most importantly that (i) one is usually interested in finding subpopulations with any treatment benefit (and not necessarily the single subgroup with largest effect) given a limited budget and that (ii) effectiveness only has to be demonstrated across the subpopulation on average — give rise to interesting challenges and new desiderata when designing algorithmic solutions.
Building on these findings, we propose AdaGGI and AdaGCPI, two meta-algorithms for subpopulation construction, which focus on identifying good subgroups and good composite subpopulations, respectively. We empirically investigate their performance across a range of simulation scenarios and derive insights into their (dis)advantages across different settings.
Why is this important?
TBC