• Skip to main content
  • Skip to header right navigation
  • Skip to site footer
CCAIM

CCAIM

Novel AI to transform healthcare

  • Home
  • About
    • Leadership
    • Joint Steering Committee
    • Independent Scientific Advisory Board
    • Visitors
    • Vacancies
  • Research
    • Papers
    • Breakthroughs
    • Demonstrators
    • Research Update: COVID-19
    • Blog
  • News
    • Latest News
    • COVID-19 News
  • Events
    • Seminar Series
    • WeCREATE
    • Inaugural Event
  • Summer School
    • Program
    • Topics
    • Speakers
    • Industry
    • Participate
    • FAQ
  • Get involved
    • Collaborate
    • PhD Programmes
    • Partners
  • Contact

Synthetic data can be used to increase fairness and protect patient confidentiality in medical research

7 October 2021 by Navneet Gidda

Our researchers are often asked how patient data is accessed and whether their privacy is protected in research settings. There is understandable concern about sharing sensitive information between healthcare institutions and research communities, but there doesn’t have to be…

The Big Concern: Patient Privacy

To positively transform healthcare for the future, the machine learning community requires open access to patients’ healthcare data. From Electronic Health Records (EHR) and biobanks to disease registries and vital records, the more high-quality medical data available to researchers, the better they can understand people and their health – and the better their machine learning solutions will be at advancing healthcare and public health delivery.

At present, access to clinical data remains quite limited for researchers. Legal protections like the Health Insurance Portability and Accountability Act (HIPAA) in America and General Data Protection Regulation (GDPR) in Europe – quite rightfully – prevent third parties from accessing sensitive patient data. But such regulations have overlooked the fact that barring researchers from accessing patient information hinders medical innovation.

A Solution: Synthetic Data

Of course, privacy is of utmost importance when it comes to information about patient health, but it cannot come at the cost of medical advancement. We need to find ways to enable access to relevant datasets that do not breach existing regulations. Simply put, we need data-sharing solutions that allow medical researchers to conduct their research effectively and protect patient privacy – the two do not have to be incompatible (as dominant narratives about data protection and AI have led us to believe). 

Our research team, led by the van der Schaar Lab, have come up with a novel framework (called ‘Anonymization through Data Synthesis using Generative Adversarial Networks’, or ADS-GAN) for generating synthetic patient records that have similar characteristics to real-world records. Instead of actual EHR data, the framework uses statistical models to create synthetic data from scratch. The resulting datasets are a version of the real data in that it does not contain any specific samples that can be traced back to individual patients but still retains its value for researchers. In a word, it produces artificial data yet conserves the most important properties of the real datasets for researchers to use.

ADS-GAN has two main objectives: to ensure the synthetic datasets are similar enough to the original data and to ensure the synthetic datasets cannot be traced back to patients. The model aims to serve as a safe, legal and ethical solution to open data sharing of health records. The datasets it produces can be made publicly available to the machine learning community without compromising patient confidentiality – opening new opportunities for researchers to develop machine learning solutions for pressing medical problems.


Why not anonymization?

The most commonly used “solution” to data privacy at the moment is the anonymization of health records: the data fields are scrambled and any elements that can identify a patient are removed. 

But this data, although it has been cleaned up to remove patient identifiers, is not necessarily private. It remains possible, with the right tools and motivations, for those handling the data to cross-reference it after it’s been anonymized and attribute it to specific patients. Such data breaches have occurred in healthcare systems around the world and have shown that anonymity is not always the appropriate solution to data privacy and may actually fail to protect patient privacy and hinder scientific progress.

Further, governing bodies have failed to set out clear guidelines on how to best anonymize healthcare data. Health data is incredibly individualized and complex. There are outliers, biases, rare conditions, and more to be considered – simply anonymizing it is too great a simplification for the machine learning community to adequately make use of it.

Is synthetic data truly an unbiased alternative?

Let us say we use synthetic data in our research. A key issue that arises is that real-world data is not always fair – it is riddled with biases against patients, especially those from historically marginalized groups. So the machine learning models that are trained using synthetic data that reflects reality will consequently be unfair as well. For example, models may discriminate against patients based on race, even if there is no explicit indication to race in the data, such information can be extracted from highly linked data like postcodes. The machine learning model will pick up on this from the synthetic data just as it would from the real data. So what can we do?

In this year’s Conference on Neural Information Processing Systems (NeurIPS 2021), Boris van Breugel of the van der Schaar Lab presented a solution: DECAF, a synthetic data generator that strategically removes biased edges in the data. In contrast to existing methods, DECAF produces high-quality synthetic data and provides theoretical guarantees that the models created using such data will operate with fairness.

Can we really create fair synthetic data from unfair real data while also preserving the utility of the data? It is definitely a challenge as there are many understandings of fairness out there. But given a particular definition of fairness (e.g. demographic parity) an increasingly unbiased dataset can be produced and then used to train machine learning models. Helpfully, the DECAF framework has been designed to be compatible with several popular definitions of fairness.

Improving access to medical data

Generating synthetic data for machine learning research is a significant challenge but is necessary to protect patient privacy and ensure ethical research practices. Machine learning and privacy are not incompatible if we take steps to safeguard people’s data before using it in our labs. 

Organisations like medConfidential have been pushing for harmonious existence between ethics, medical practice, and research for years. Their work illustrates how transparency measures, more secure facilities, and patient consent can allow data sharing to co-exist with GP-patient privacy. 

The van der Schaar Lab also hosts a ‘Hide-and-Seek Privacy Challenge’ each year to accelerate progress in tackling the privacy issue. Participating researchers go head-to-head to uncover the best approaches for launching or defending against privacy attacks. The challenge allows the Lab to understand the strengths and weaknesses of machine learning techniques on both sides of the privacy battle.

The AI-enabled healthcare systems and infrastructure we envision cannot reach their full potential if researchers cannot access high quality data or if patient confidentiality is violated. Models like ADS-GAN can allow for more flexible access, help strengthen trust in institutions’ handling of private health records, and also reinforce the greater narrative of AI for medicine as a societal good that should be pursued, and that can be done in an ethical manner.

——

The van der Schaar Lab is a world-leading research group led by Mihaela van der Schaar, John Humphrey Plummer Professor of Machine Learning, AI and Medicine at the University of Cambridge. 

Boris van Breugel is a Ph.D. student with the van der Schaar Lab. He aims to develop methods for finding meaningful structure in omics data. 

Read the full paper on ADS-GAN → 
van der Schaar Lab: Synthetic Data → 
Like this? Find more on our blog → 

Category: BlogTag: ai bias, bias, confidentiality, data privacy, EHR, fairness, health data, health records, healthcare, privacy
Previous Post: « Million-patient study shows strength of machine learning in recommending breast cancer therapies
Next Post: CCAIM awards four new PhD studentships and welcomes Associate Faculty members »

Navigation

Home

News

About

University of Cambridge

  • University A-Z
  • Contact the University
  • Accessibility
  • Data Protection
  • Terms and conditions

Newsletter

Sign-up for updates on our research.

Follow us

  • Twitter
  • LinkedIn
  • YouTube

Copyright © 2022 CCAIM

Return to top

We are using cookies to give you the best experience on our website.

You can find out more about which cookies we are using or switch them off in settings.

CCAIM
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Strictly Necessary Cookies

Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.

If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.

3rd Party Cookies

This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.

Keeping this cookie enabled helps us to improve our website.

Please enable Strictly Necessary Cookies first so that we can save your preferences!