Our researchers are often asked how patient data is accessed and whether their privacy is protected in research settings. There is understandable concern about sharing sensitive information between healthcare institutions and research communities, but there doesn’t have to be…
The Big Concern: Patient Privacy
To positively transform healthcare for the future, the machine learning community requires open access to patients’ healthcare data. From Electronic Health Records (EHR) and biobanks to disease registries and vital records, the more high-quality medical data available to researchers, the better they can understand people and their health – and the better their machine learning solutions will be at advancing healthcare and public health delivery.
At present, access to clinical data remains quite limited for researchers. Legal protections like the Health Insurance Portability and Accountability Act (HIPAA) in America and General Data Protection Regulation (GDPR) in Europe – quite rightfully – prevent third parties from accessing sensitive patient data. But such regulations have overlooked the fact that barring researchers from accessing patient information hinders medical innovation.
A Solution: Synthetic Data
Of course, privacy is of utmost importance when it comes to information about patient health, but it cannot come at the cost of medical advancement. We need to find ways to enable access to relevant datasets that do not breach existing regulations. Simply put, we need data-sharing solutions that allow medical researchers to conduct their research effectively and protect patient privacy – the two do not have to be incompatible (as dominant narratives about data protection and AI have led us to believe).
Our research team, led by the van der Schaar Lab, have come up with a novel framework (called ‘Anonymization through Data Synthesis using Generative Adversarial Networks’, or ADS-GAN) for generating synthetic patient records that have similar characteristics to real-world records. Instead of actual EHR data, the framework uses statistical models to create synthetic data from scratch. The resulting datasets are a version of the real data in that it does not contain any specific samples that can be traced back to individual patients but still retains its value for researchers. In a word, it produces artificial data yet conserves the most important properties of the real datasets for researchers to use.
ADS-GAN has two main objectives: to ensure the synthetic datasets are similar enough to the original data and to ensure the synthetic datasets cannot be traced back to patients. The model aims to serve as a safe, legal and ethical solution to open data sharing of health records. The datasets it produces can be made publicly available to the machine learning community without compromising patient confidentiality – opening new opportunities for researchers to develop machine learning solutions for pressing medical problems.
Why not anonymization?
The most commonly used “solution” to data privacy at the moment is the anonymization of health records: the data fields are scrambled and any elements that can identify a patient are removed.
But this data, although it has been cleaned up to remove patient identifiers, is not necessarily private. It remains possible, with the right tools and motivations, for those handling the data to cross-reference it after it’s been anonymized and attribute it to specific patients. Such data breaches have occurred in healthcare systems around the world and have shown that anonymity is not always the appropriate solution to data privacy and may actually fail to protect patient privacy and hinder scientific progress.
Further, governing bodies have failed to set out clear guidelines on how to best anonymize healthcare data. Health data is incredibly individualized and complex. There are outliers, biases, rare conditions, and more to be considered – simply anonymizing it is too great a simplification for the machine learning community to adequately make use of it.
Is synthetic data truly an unbiased alternative?
Let us say we use synthetic data in our research. A key issue that arises is that real-world data is not always fair – it is riddled with biases against patients, especially those from historically marginalized groups. So the machine learning models that are trained using synthetic data that reflects reality will consequently be unfair as well. For example, models may discriminate against patients based on race, even if there is no explicit indication to race in the data, such information can be extracted from highly linked data like postcodes. The machine learning model will pick up on this from the synthetic data just as it would from the real data. So what can we do?
In this year’s Conference on Neural Information Processing Systems (NeurIPS 2021), Boris van Breugel of the van der Schaar Lab presented a solution: DECAF, a synthetic data generator that strategically removes biased edges in the data. In contrast to existing methods, DECAF produces high-quality synthetic data and provides theoretical guarantees that the models created using such data will operate with fairness.
Can we really create fair synthetic data from unfair real data while also preserving the utility of the data? It is definitely a challenge as there are many understandings of fairness out there. But given a particular definition of fairness (e.g. demographic parity) an increasingly unbiased dataset can be produced and then used to train machine learning models. Helpfully, the DECAF framework has been designed to be compatible with several popular definitions of fairness.
Improving access to medical data
Generating synthetic data for machine learning research is a significant challenge but is necessary to protect patient privacy and ensure ethical research practices. Machine learning and privacy are not incompatible if we take steps to safeguard people’s data before using it in our labs.
Organisations like medConfidential have been pushing for harmonious existence between ethics, medical practice, and research for years. Their work illustrates how transparency measures, more secure facilities, and patient consent can allow data sharing to co-exist with GP-patient privacy.
The van der Schaar Lab also hosts a ‘Hide-and-Seek Privacy Challenge’ each year to accelerate progress in tackling the privacy issue. Participating researchers go head-to-head to uncover the best approaches for launching or defending against privacy attacks. The challenge allows the Lab to understand the strengths and weaknesses of machine learning techniques on both sides of the privacy battle.
The AI-enabled healthcare systems and infrastructure we envision cannot reach their full potential if researchers cannot access high quality data or if patient confidentiality is violated. Models like ADS-GAN can allow for more flexible access, help strengthen trust in institutions’ handling of private health records, and also reinforce the greater narrative of AI for medicine as a societal good that should be pursued, and that can be done in an ethical manner.
——
The van der Schaar Lab is a world-leading research group led by Mihaela van der Schaar, John Humphrey Plummer Professor of Machine Learning, AI and Medicine at the University of Cambridge.
Boris van Breugel is a Ph.D. student with the van der Schaar Lab. He aims to develop methods for finding meaningful structure in omics data.