Each IRP will develop a particular aspect of multimodal data integration, working on different types of data, different methods or algorithms, and different biomedical research questions. Different task forces are involved at different level of implication in eah IRP:
Multi-omics data integration
Use of prior knowledge
Exploration of health databases
Biases & interpretation
Benchmarking of the methods
IRP 1
Integrative analysis of multi-omics data for the prioritization of genomic variants
Task forces involved
IRP 2
Probing the regulatory impact of genetic variants
Task forces involved
IRP 3
Statistical inference of cellular heterogeneity using multi-omic prior biological knowledge
Task forces involved
IRP 4
Integrating prior knowledge for better patient representation
Task forces involved
Jedrzej Kubica
PhD student
Start 01/09/2024
Title: Integrative analysis of multi-omics data for the prioritization of genomic variants
Background
Exome and genome sequencing is a promising approach to discover new variants and genes involved in the etiology of diseases with a genetic component. Current methods to analyze such data work at the single-variant or single-gene levels: for example, if a gene is severely impacted in several patients exhibiting a common phenotype and never affected in control individuals, this gene will be identified as a good candidate. This strategy has proven effective and has allowed us to contribute to the identification of many disease-causing genes and variants. However, extending the approach beyond the single-gene level, by leveraging the guilt-by- association principle and taking advantage of interactome, transcriptome and other publicly available omics data, constitutes a promising and currently under-explored avenue of research.
Objectives
The goal of this IRP is to develop a multi-omics integrative analysis method to score exome-seq-derived genomic variants based on the likelihood that they are causal for a patient’s phenotype/disease.
Key-words
next generation sequencing, interactome, multi-omics, bioinformatics, biostatistics, machine learning, integrative data analysis
Download the detailed IRP profil –>
Nicolas Thierry-Mieg
CNRS / TIMC
Grenoble
HOST LAB
Sébastien Déjean
UT3 / IMT
Toulouse
SECONDMENT LAB
Elliot Butz
PhD student
Start 01/09/2024
Title: Probing the regulatory impact of genetic variants
Background
Genomic medicine is currently changing scale with Genome Sequencing accessible in patient care through diverse national sequencing initiatives. Still, most patients are waiting for a diagnosis. Initiatives of data sharing and adoption of FAIR principles (Findability, Accessibility, Interoperability, and Reuse of digital assets) provide the scientific community an unprecedented opportunity to explore and enhance the interpretation of the human genome.
In France, the Plan France Médecine Génomique 2025 lead by AVIESAN aims with the Collecteur Analyseur de Données (CAD) to provide an access to well-genotyped and phenotyped datasets, to decipher mechanisms of pathogenicity. Other nationwide initiatives, such as UK biobank or FinnGen, provide access to pan-genomic sequencing datasets associated with deep phenotyping. There is a desperate need for tools to help clinical biologists translate these data into clinics, including e.g., in silico prediction tools for genomic variants. Current clinical genetic testing focuses almost exclusively on regions of the genome that directly encode proteins. The important role of variants in non-coding regions is, however, increasingly being demonstrated, and the use of genome sequencing in clinical diagnostic settings is rising across a large range of genetic disorders.
Objectives
This research project aims at building machine learning-based models capable of providing accessible information for clinical interpretation of genetic variations in non-coding regions of the genome.
Title: Statistical inference of cellular heterogeneity using multi-omic prior biological knowledge
Background
Cellular heterogeneity in biological samples is a key factor that determines disease progression, but also influences biomedical analysis of samples and patient classification. At the molecular level, the cellular composition of tissues is difficult to assess and quantify, as it is hidden within the bulk molecular profiles of samples (average profile of millions of cells), with all cells present in the tissue contributing to the recorded signal. Despite great promise, conventional computational approaches to quantifying cellular heterogeneity from mixtures of cells have encountered difficulties in providing robust and biologically relevant estimates.
Objectives
So far, most statistical methods used for cell deconvolution ignore the biological relationships between the molecular features used in the models. Our goal is to provide a statistical framework for deconvolution including (i) the stochastic dependence across molecular features induced by the mutual regulation mechanisms; (ii) the a priori knowledge of the topology of multilayer interaction networks; and (iii) the similarity between samples that may be induced by controlled experimental conditions.
Key-words
cell deconvolution, gene network modeling, high-dimensional inference, multi-omic data integration
Download the detailed IRP profil –>
David Causeur
IRMAR
Rennes
HOST LAB
Yuna Blum & Magali Richard
IGDR / TIMC
Rennes / Grenoble
SECONDMENT LAB
Océane Carpentier
PhD student
Start 01/09/2024
Title: Integrating prior knowledge for better patient representation
Background
One of the current challenges of precision medicine is to integrate heterogeneous data for the most adequate description of the patient. Today, biomedical data come from multiple sources (biomic data, imaging, microbiota, clinical notes, drug prescriptions, claim databases…), each data type being structured in a specific way. These available data can also be enriched with a priori information. For example, it is possible to link the biomic data to interaction graphs, the imaging data to features known to be relevant for diagnosis, the microbiota to functional annotations, or prescriptions to drug knowledge bases.
Objectives
The co-supervised PhD project aims at developing methods for the analysis of patients’ data harnessing prior knowledge for better performances. We will focus on enhancing i) biomic and ii) medical and administrative data available for patients using prior knowledge. Knowledge integration will be based on semantic web technologies or more broadly on knowledge graphs, widely used by the community to structure information. The aim is to quantify the contribution of this a priori information to classical risk analysis models as well as to more complex dimension reduction models, such as auto-encoders.
Key-words
multimodal data integration, machine learning, knowledge models
Download the detailed IRP profil –>
Emmanuelle Becker / Yann Le Cunff
IRISA
Rennes
HOST LAB
Nicolas Jay / Aurélie Bannay
LORIA
Nancy
SECONDMENT LAB
IRP 5
Defining patients’ phenotype from a mixed approach based on data and expert knowledge
Task forces involved
IRP 6
Extracting genotype-phenotype associations from network integration of deep and multiscale phenotype data
Task forces involved
IRP 7
Predictive properties of models/algorithms according to their contexts of use (contextualized predictions)
Task forces involved
IRP 8
Towards better quality, reliability and fairness of large-scale distributed heterogeneous healthcare data
Task forces involved
Renée Le Clech
PhD student
Start 01/09/2024
Title: Defining patients' phenotype from a mixed approach based on data and expert knowledge
Background
One of the challenges in using data from health databases is the characterization of phenotypes and the grouping of patients sharing the same phenotype using variables of different modalities and scales. For example, patients suffering from diabetes can be identified from the medication they take, examinations they have undergone, or a diagnosis of diabetes during hospitalization. This grouping can be done through an expert approach, using the expert metadata associated with the variables. In addition, the classification of the metadata, for instance in ontologies, can be exploited. This grouping can also be done in a data-driven way, using correlations within the data. Indeed, variables concerning the same phenotype are correlated with each other: for example, when a headache occurs, the management will involve several simultaneous actions. This is in line with the work that we already performed to identify patient subgroups from longitudinal data. Importantly, the identification of a group of patients who have the same phenotype using both data-driven variable correlations and expert metadata is still missing in the field. Such an approach is a task involving unsupervised learning methods with the challenge of calculating distances between patients incorporating all this information.
Objectives
The aim of this thesis project is to develop a generic method for identifying subgroups of patients with the same phenotype from health databases, using jointly variable correlations and expert data, and to implement it within a computer package.
Key-words
clustering, medical knowledge representation, rare diseases
Download the detailed IRP profil –>
Anne-Sophie Jannot / Nicolas Garcelon
HeKA team
Paris
HOST LAB
Nicolas Jay / Aurélie Bannay
LORIA
Nancy
SCONDEMENT LAB
Florence Ghestem
PhD student
Start 01/09/2024
Title: Extracting genotype-phenotype associations from network integration of deep and multiscale phenotype data
Background
Deep phenotypic data are available in health databases through medical questionnaires and electronic health records (EHR), which can include drug prescriptions, lab results, information extracted from notes using natural language processing, or billing codes. Informative phenotypes are however often not directly available. There are two categories of methods to define disease phenotypes from health databases. First, methods relying on a predefined validated algorithm specifically created for a given phenotype to identify cases and controls (expert approach using metadata). Second, data-driven methods relying on automated approaches. Network-based approaches belong to this second category and rely on similarity measures between individuals to identify homogeneous subgroups (phenotypic clusters).
Objectives
We recently developed an unsupervised network approach to analyze drug prescriptions from the EGB, a French medico-administrative database and identified clinically relevant patient subgroups. We aim to extend this work by integrating a broader range of multiscale phenotypic and genomic data. We aim to extract clinically meaningful information, to identify patient subgroups and to correlate these subgroups with genomic variations.
Key-words
multilayer network, clustering, genomic data, electronic health records (EHR)
Title: Predictive properties of models/algorithms according to their contexts of use (contextualized predictions)
Background
Diagnostic, prognostic or theranostic (clinical response) models will gradually be considered as medical devices. In the context of precision medicine, the performances of machine learning or deep learning models calculated on available data don’t meet the expectations of clinicians to support them in medical decision making. It is necessary to estimate the individual predictive properties of these models in terms of accuracy and possible bias, as a function of the density of training data in the vicinity of new individual profiles.
Objectives
This PhD project is in line with the current trend towards predictive, preventive, personalized and participatory medicine, closer to the patient. (1) The objective is to evaluate the performance of machine learning and deep learning models by estimating the quality of their predictive properties in terms of bias and uncertainty according to the context of their applications (contextualized predictions). In addition, statistical methods will be implemented to improve these predictive properties. (2) The own and combined (synergistic) effects of variables of different types will also be studied by extending the approaches developed in the context of machine learning (Roy 1998, Charvat 2013).
Key-words
synergy, contextualized prediction, multimodal data
Download the detailed IRP profil –>
Pascal Roy / Mathieu Fauvernier
LBBE
Lyon
HOST LAB
David Causeur
IRMAR
Rennes
SECONDMENT LAB
Silvia Grosso
PhD student
Start 08/11/2023
Title: Towars better quality, reliability and fairness of large-scale distributed heterogeneous healthcare data
Background
Federated Learning (FL) is a promising paradigm that is gaining grip in the context of privacy-preserving Machine Learning (ML). Thanks to FL, several data owners (e.g., health organizations) can collaboratively train a model on their private and decentralized data, without having to send their raw data to external service providers, thus keeping their data private. FL helps make 5P medicine a reality, i.e., allowing personalized, predictive, preventive, participatory and populational medicine. In this context, the integration of data, without a priori knowledge of the available data and their semantics, is a major challenge (60). Furthermore, although FL has improved the privacy of ML by decentralizing the data and the learning process, a line of recent literature shows that FL can exacerbate the problem of bias and unfairness (61), when models produce unfair decisions due to the use of incomplete, faulty, or prejudicial datasets and models. FL may exacerbate the problem of bias because of the decentralized nature of FL, where data distribution and size are particularly heterogeneous.
Objectives
This IRP aims to provide FL data management methods and distributed protocols for handling heterogeneous and large-scale healthcare data, through: (i) Multimodal FL for diagnosis, prognosis and therapeutic response; (ii) Contextualized FL prediction for adapting FL models to their context of use; (iii) Appropriate representation of metadata to better estimate data quality and reliability; (iv) Fairness and bias mitigation in healthcare FL models; and (iv) A multi-objective approach to take into account privacy, fairness and model accuracy aspects, these objectives being usually antagonistic.