IRP – M4DI

IRPs

Individual Research Projects

Each IRP will develop a particular aspect of multimodal data integration, working on different types of data, different methods or algorithms, and different biomedical research questions. Different task forces are involved at different level of implication in eah IRP:

Multi-omics data integration

Use of prior knowledge

Exploration of health databases

Biases & interpretation

Benchmarking of the methods

IRP 1

Integrative analysis of multi-omics data for the prioritization of genomic variants

Task forces involved

IRP 2

Probing the regulatory impact of genetic variants

Task forces involved

IRP 3

Statistical inference of cellular heterogeneity using multi-omic prior biological knowledge

Task forces involved

IRP 4

Integrating prior knowledge for better patient representation

Task forces involved

Jedrzej Kubica

PhD student

Start 01/09/2024

Title: Integrative analysis of multi-omics data for the prioritization of genomic variants

Background

Exome and genome sequencing is a promising approach to discover new variants and genes involved in the etiology of diseases with a genetic component. Current methods to analyze such data work at the single-variant or single-gene levels: for example, if a gene is severely impacted in several patients exhibiting a common phenotype and never affected in control individuals, this gene will be identified as a good candidate. This strategy has proven effective and has allowed us to contribute to the identification of many disease-causing genes and variants. However, extending the approach beyond the single-gene level, by leveraging the guilt-by- association principle and taking advantage of interactome, transcriptome and other publicly available omics data, constitutes a promising and currently under-explored avenue of research.

Objectives

The goal of this IRP is to develop a multi-omics integrative analysis method to score exome-seq-derived genomic variants based on the likelihood that they are causal for a patient’s phenotype/disease.

Key-words

next generation sequencing, interactome, multi-omics, bioinformatics, biostatistics, machine learning, integrative data analysis

Download the detailed IRP profil –>

Nicolas Thierry-Mieg

CNRS / TIMC

Grenoble

HOST LAB

Sébastien Déjean

UT3 / IMT

Toulouse

SECONDMENT LAB

Elliot Butz

PhD student

Start 01/09/2024

Title: Probing the regulatory impact of genetic variants

Background

Genomic medicine is currently changing scale with Genome Sequencing accessible in patient care through diverse national sequencing initiatives. Still, most patients are waiting for a diagnosis. Initiatives of data sharing and adoption of FAIR principles (Findability, Accessibility, Interoperability, and Reuse of digital assets) provide the scientific community an unprecedented opportunity to explore and enhance the interpretation of the human genome.

In France, the Plan France Médecine Génomique 2025 lead by AVIESAN aims with the Collecteur Analyseur de Données (CAD) to provide an access to well-genotyped and phenotyped datasets, to decipher mechanisms of pathogenicity. Other nationwide initiatives, such as UK biobank or FinnGen, provide access to pan-genomic sequencing datasets associated with deep phenotyping. There is a desperate need for tools to help clinical biologists translate these data into clinics, including e.g., in silico prediction tools for genomic variants. Current clinical genetic testing focuses almost exclusively on regions of the genome that directly encode proteins. The important role of variants in non-coding regions is, however, increasingly being demonstrated, and the use of genome sequencing in clinical diagnostic settings is rising across a large range of genetic disorders.

Objectives

This research project aims at building machine learning-based models capable of providing accessible information for clinical interpretation of genetic variations in non-coding regions of the genome.

Key-words

machine learning, regulatory genomics, genetics, transcription

Download the detailed IRP profil –>

Laurent Bréhélin / Charles Lecellier / Kevin Yauy

LIRMM / IGMM / IMAG / CHU

Montpellier

HOST LAB

Cécile Capponi / Thierry Artière

LIS

Marseille

SECONDMENT LAB

Hugo Barbot

PhD student

Start 01/10/2023

Title: Statistical inference of cellular heterogeneity using multi-omic prior biological knowledge

Background

Cellular heterogeneity in biological samples is a key factor that determines disease progression, but also influences biomedical analysis of samples and patient classification. At the molecular level, the cellular composition of tissues is difficult to assess and quantify, as it is hidden within the bulk molecular profiles of samples (average profile of millions of cells), with all cells present in the tissue contributing to the recorded signal. Despite great promise, conventional computational approaches to quantifying cellular heterogeneity from mixtures of cells have encountered difficulties in providing robust and biologically relevant estimates.

Objectives

So far, most statistical methods used for cell deconvolution ignore the biological relationships between the molecular features used in the models. Our goal is to provide a statistical framework for deconvolution including (i) the stochastic dependence across molecular features induced by the mutual regulation mechanisms; (ii) the a priori knowledge of the topology of multilayer interaction networks; and (iii) the similarity between samples that may be induced by controlled experimental conditions.

Key-words

cell deconvolution, gene network modeling, high-dimensional inference, multi-omic data integration

Download the detailed IRP profil –>

David Causeur

IRMAR

Rennes

HOST LAB

Yuna Blum & Magali Richard

IGDR / TIMC

Rennes / Grenoble

SECONDMENT LAB

Océane Carpentier

PhD student

Start 01/09/2024

Title: Integrating prior knowledge for better patient representation

Background

One of the current challenges of precision medicine is to integrate heterogeneous data for the most adequate description of the patient. Today, biomedical data come from multiple sources (biomic data, imaging, microbiota, clinical notes, drug prescriptions, claim databases…), each data type being structured in a specific way. These available data can also be enriched with a priori information. For example, it is possible to link the biomic data to interaction graphs, the imaging data to features known to be relevant for diagnosis, the microbiota to functional annotations, or prescriptions to drug knowledge bases.

Objectives

The co-supervised PhD project aims at developing methods for the analysis of patients’ data harnessing prior knowledge for better performances. We will focus on enhancing i) biomic and ii) medical and administrative data available for patients using prior knowledge. Knowledge integration will be based on semantic web technologies or more broadly on knowledge graphs, widely used by the community to structure information. The aim is to quantify the contribution of this a priori information to classical risk analysis models as well as to more complex dimension reduction models, such as auto-encoders.

Key-words

multimodal data integration, machine learning, knowledge models

Download the detailed IRP profil –>

Emmanuelle Becker / Yann Le Cunff

IRISA

Rennes

HOST LAB

Nicolas Jay / Aurélie Bannay

LORIA

Nancy

SECONDMENT LAB

IRP 5

Defining patients’ phenotype from a mixed approach based on data and expert knowledge

Task forces involved

IRP 6

Extracting genotype-phenotype associations from network integration of deep and multiscale phenotype data

Task forces involved

IRP 7

Predictive properties of models/algorithms according to their contexts of use (contextualized predictions)

Task forces involved

IRP 8

Towards better quality, reliability and fairness of large-scale distributed heterogeneous healthcare data

Task forces involved

Renée Le Clech

PhD student

Start 01/09/2024

Title: Defining patients' phenotype from a mixed approach based on data and expert knowledge

Background

One of the challenges in using data from health databases is the characterization of phenotypes and the grouping of patients sharing the same phenotype using variables of different modalities and scales. For example, patients suffering from diabetes can be identified from the medication they take, examinations they have undergone, or a diagnosis of diabetes during hospitalization. This grouping can be done through an expert approach, using the expert metadata associated with the variables. In addition, the classification of the metadata, for instance in ontologies, can be exploited. This grouping can also be done in a data-driven way, using correlations within the data. Indeed, variables concerning the same phenotype are correlated with each other: for example, when a headache occurs, the management will involve several simultaneous actions. This is in line with the work that we already performed to identify patient subgroups from longitudinal data. Importantly, the identification of a group of patients who have the same phenotype using both data-driven variable correlations and expert metadata is still missing in the field. Such an approach is a task involving unsupervised learning methods with the challenge of calculating distances between patients incorporating all this information.

Objectives

The aim of this thesis project is to develop a generic method for identifying subgroups of patients with the same phenotype from health databases, using jointly variable correlations and expert data, and to implement it within a computer package.

Key-words

clustering, medical knowledge representation, rare diseases

Download the detailed IRP profil –>

Anne-Sophie Jannot / Nicolas Garcelon

HeKA team

Paris

HOST LAB

Nicolas Jay / Aurélie Bannay

LORIA

Nancy

SCONDEMENT LAB

Florence Ghestem

PhD student

Start 01/09/2024

Title: Extracting genotype-phenotype associations from network integration of deep and multiscale phenotype data

Background

Deep phenotypic data are available in health databases through medical questionnaires and electronic health records (EHR), which can include drug prescriptions, lab results, information extracted from notes using natural language processing, or billing codes. Informative phenotypes are however often not directly available. There are two categories of methods to define disease phenotypes from health databases. First, methods relying on a predefined validated algorithm specifically created for a given phenotype to identify cases and controls (expert approach using metadata). Second, data-driven methods relying on automated approaches. Network-based approaches belong to this second category and rely on similarity measures between individuals to identify homogeneous subgroups (phenotypic clusters).

Objectives

We recently developed an unsupervised network approach to analyze drug prescriptions from the EGB, a French medico-administrative database and identified clinically relevant patient subgroups. We aim to extend this work by integrating a broader range of multiscale phenotypic and genomic data. We aim to extract clinically meaningful information, to identify patient subgroups and to correlate these subgroups with genomic variations.

Key-words

multilayer network, clustering, genomic data, electronic health records (EHR)

Download the detailed IRP profil –>

Anne-Louise Leutenegger / Anne-Sophie Jannot / Andrée Delahaye-Duriez

NeuroDiderot / HeKa Team

Paris

HOST LAB

Anaïs Baudot

MMG

Marseille

SECONDMENT LAB

Joanna Pautonnier

PhD student

Start 03/02/2025

Title: Predictive properties of models/algorithms according to their contexts of use (contextualized predictions)

Background

Diagnostic, prognostic or theranostic (clinical response) models will gradually be considered as medical devices. In the context of precision medicine, the performances of machine learning or deep learning models calculated on available data don’t meet the expectations of clinicians to support them in medical decision making. It is necessary to estimate the individual predictive properties of these models in terms of accuracy and possible bias, as a function of the density of training data in the vicinity of new individual profiles.

Objectives

This PhD project is in line with the current trend towards predictive, preventive, personalized and participatory medicine, closer to the patient. (1) The objective is to evaluate the performance of machine learning and deep learning models by estimating the quality of their predictive properties in terms of bias and uncertainty according to the context of their applications (contextualized predictions). In addition, statistical methods will be implemented to improve these predictive properties. (2) The own and combined (synergistic) effects of variables of different types will also be studied by extending the approaches developed in the context of machine learning (Roy 1998, Charvat 2013).

Key-words

synergy, contextualized prediction, multimodal data

Download the detailed IRP profil –>

Pascal Roy / Mathieu Fauvernier

LBBE

Lyon

HOST LAB

David Causeur

IRMAR

Rennes

SECONDMENT LAB

Silvia Grosso

PhD student

Start 08/11/2023

Title: Towars better quality, reliability and fairness of large-scale distributed heterogeneous healthcare data

Background

Federated Learning (FL) is a promising paradigm that is gaining grip in the context of privacy-preserving Machine Learning (ML). Thanks to FL, several data owners (e.g., health organizations) can collaboratively train a model on their private and decentralized data, without having to send their raw data to external service providers, thus keeping their data private. FL helps make 5P medicine a reality, i.e., allowing personalized, predictive, preventive, participatory and populational medicine. In this context, the integration of data, without a priori knowledge of the available data and their semantics, is a major challenge (60). Furthermore, although FL has improved the privacy of ML by decentralizing the data and the learning process, a line of recent literature shows that FL can exacerbate the problem of bias and unfairness (61), when models produce unfair decisions due to the use of incomplete, faulty, or prejudicial datasets and models. FL may exacerbate the problem of bias because of the decentralized nature of FL, where data distribution and size are particularly heterogeneous.

Objectives

This IRP aims to provide FL data management methods and distributed protocols for handling heterogeneous and large-scale healthcare data, through: (i) Multimodal FL for diagnosis, prognosis and therapeutic response; (ii) Contextualized FL prediction for adapting FL models to their context of use; (iii) Appropriate representation of metadata to better estimate data quality and reliability; (iv) Fairness and bias mitigation in healthcare FL models; and (iv) A multi-objective approach to take into account privacy, fairness and model accuracy aspects, these objectives being usually antagonistic.

Key-words

bias, fairness, quality, healthcare data, multimodality, federated learning

Download the detailed IRP profil –>

Sara Bouchenak / Mohand-Saïd Hacid

LIRIS

Lyon

HOST LAB

Delphine Maucourt-Boulch / Pascal Roy

IRMAR

Rennes