Data analysis and Benchmarking

Benchmarking atlas-level data integration in single-cell

Cell atlases often include samples that span locations, labs, and conditions, leading to complex, nested batch effects in data. Thus, joint analysis of atlas datasets requires reliable data integration. Choosing a data integration method is a challenge due to the difficulty of defining integration success. Therefore, we have developed kBET (k-nearest neighbour batch effect test) to quantify any type of batch effect present in single-cell RNAseq data. Using this tool, we have studied common normalisation and batch-correction approaches for their ability to remove batch effects and to preserve biological signals.

Building upon the batch effect correction results in low-to-medium complexity tasks, we benchmark 695 method and preprocessing combinations on 77 batches of gene expression, chromatin accessibility, and simulation data from 23 publications, altogether representing >1.2 million cells distributed in nine atlas-level integration tasks. Our integration tasks span several common sources of variation such as individuals, species, and experimental labs. We evaluate methods using 14 metrics according to scalability, usability, and their ability to remove batch effects while retaining biological variation.

Not all of the tested methods were applicable to the largest scenario with almost 1 million cells. With the huge collaborative effort on building atlases, iterative integration into a reference atlas becomes an interesting alternative to integrating all datasets at once. We developed a transfer learning approach called scArches that allows reference building, updating and model sharing without the need to share all datasets.

Related Publications:


 Generating and utilizing single-cell atlases

We regularly contribute to the generation and interpretation of several tissue atlases using machine learning based data analysis. The Human Cell Atlas consortium aims to characterize every cell within the healthy human body and includes lung tissue as one of the flagship projects. To this end we have combined single-cell multi-omics and artificial intelligence with algorithms developed in the lab to better understand the heterogeneity and function of healthy lungs as well as cellular remodeling in aged lung cells. Specifically, driven by the recent global SARS-CoV-2 pandemic, we performed single-cell meta-analysis of SARS-CoV-2 entry genes across tissues and demographics. ACE2 and accessory proteases (TMPRSS2, CTSL) are needed for SARS-CoV-2 cellular entry, and their expression may shed light on viral tropism and impact across the body. We assess the cell type-specific expression of ACE2, TMPRSS2, and CTSL across 107 single-cell RNA-Seq studies from  different tissues. We performed a meta-analysis of 31 lung scRNA-seq studies with 1,320,896 cells from 377 nasal, airway, and lung parenchyma samples from 228 individuals. This revealed cell type specific associations of age, sex, and smoking with expression levels of ACE2, TMPRSS2, and CTSL. Cell type-specific expression patterns may contribute to COVID-19 pathogenesis, and our work highlights putative molecular pathways for therapeutic intervention.