
Tabular datasets encountered in the real world often contain distinct feature sets, or views, that originate from different sources of measurement. For instance, the UK Biobank contains measurements of sociodemographic factors, heart and lung function, genomic data, and electronic health records, each providing information on a different aspect of a patient’s medical state, but also dependent on one another to form a holistic medical context.
While different feature sets can be consolidated into a single table, doing so can result in suboptimal learning performance due to heterogeneity among feature sets and the loss of valuable relational information.
In their paper for ICML 2023, Tennison Liu, Jeroen Berrevoets, Zhaozhi Qian, and Mihaela van der Schaar address unsupervised representation learning on tabular data containing multiple views generated by distinct sources of measurement. Traditional methods assume that feature sets share the same information and globally shared factors should be learned, but this assumption is not always valid for real-world tabular datasets with complex dependencies between feature sets.
To overcome this, our researchers propose a data-driven approach that represents feature sets as graph nodes and their relationships as learnable edges. We introduce LEGATO, a hierarchical graph auto-encoder that learns a smaller, latent graph to dynamically aggregate information from multiple views. This results in specialised latent graph components that capture localised information from different regions of the input, improving downstream performance.
Find out more about this in our paper here.
See all of CCAIM’s contributions to ICML 2023 here.