Computational RNA Biology

Modeling protein-RNA interactions in human health and diseases

The observation of widespread  RNA-protein interactions in distinct biological processes, and the contribution of mis-regulated interactions to human diseases, open up the possibility of novel approaches to the therapeutic intervention and points to the importance of developing new analytical methods to dissect these interactions.

The iCLIP and eCLIP techniques have facilitated in the past few years the detection of protein–RNA interaction sites at high resolution. Previously we have developed PureCLIP (, a non-homogeneous hidden Markov model-based approach to detect protein-RNA interaction sites, while explicitly incorporating specific biases into the model. This enables the identification of bona fide RNA-protein interactions with unprecedent precision. In addition, to characterize sequence-structure affinity patterns of RBPs for RNA molecules we have developed several machine learning methods for RBP prediction, de novo RNA motif finding and binding site predictions using string kernel Support Vector Machines (Bressin et al., NAR 2019, Hidden Markov Models (Heller et al., NAR 2017 and Convolutional neural Networks to learn more complicated RBP-RNA interaction patterns (Budach et al., Bioinformatics 2018

In the future we aim at extending our machine learning models to clinical RNA-seq and CLIP-seq data in order to detect differential RBP binding across patients in the context of cancer and diabetes (Figure 2), and the effect of disease-associated genetic variations on RBP-RNA interactions. In collaboration with several experimental groups at the Helmholtz and MPI Berlin we also aim at 1) understanding how newly characterized RNA post-transcriptional modifications affect RNA-binding and 2) building the post-transcriptional network regulating retrotransposon activity in health and diseases.


Figure 2: Detection of differential RNA Binding


Dissecting microRNA functions in human health and diseases

MicroRNAs are small RNAs that post-transcriptionally regulate gene expression. It is estimated that more than 60% of human transcripts harbor microRNA binding sites and are potentially regulated by these molecules during early development, health and diseases. But how can we computationally pinpoint the crucial microRNAs involved in a certain process from high-throughput data and reconstruct miRNA-mediated gene regulatory networks?

In our previous group in Berlin we have developed several tools and established computational analysis to answer the questions above. For example, we have developed a semi-supervised machine learning method for microRNA promoter recognition called PROmiRNA (Marsico et al. Genome Biol 2013), which has allowed us for the first time to study the characteristics of regulatory elements of different microRNAs as well as their genetic variation across human populations (Budach et al. Genetics 2016).

As part of the SFB-TR48 consortium ( we are collaborating with the experimental group of Prof. Bernd Schmeck at the Institute for Lung Research at the Philipps University of Marburg, to characterize the regulation of microRNAs in L. pneumophila lung infection. By  integrating different types of high-throughput data we aim at reconstructing the small RNA-mediated regulatory network describing the host-pathogen cross-talk during the inflammation process.

Together with the group of Aydan Bulut-Karslioglu at MPIMG Berlin we are trying to discover new microRNAs which regulate the mysterious process of embryo developmental pausing from next-generation sequencing data, and predict their function computationally.


Supervised and unsupervised machine learning methods to predict long non-coding RNA functions and disease associations

The discovery that a considerable portion of eukaryotic genomes is transcribed and generates long non-coding RNAs (lncRNAs) raises questions about the centrality of these lncRNAs in gene regulatory processes and diseases. The rapidly increasing number of mechanistically investigated lncRNAs has provided evidence for different functional classes, such as enhance-like lncRNAs, which modulate gene expression via chromatin looping (Ntini et al. Nat Commun 2018). However, despite great progress in the last years, the majority of lncRNAs remain functionally uncharacterized. 

In our group we started using network analysis approaches and clustering-on-graph techniques to investigate the chromatin interaction network involving lncRNAs, protein-coding genes and other DNA regulatory elements in order to pinpoint the important lncRNA-mediated gene regulatory modules in a certain biological process (Figure 3).

Focusing on enhancer-like lncRNAs, we are currently using experimental techniques and machine learning models to perform a first in silico classification of lncRNAs, with special effort in developing a framework for meaningful integration of different omics data. In addition, focusing on the well-known Xist lncRNA involved in X-chromosome inactivation, in collaboration with Edith Heard lab at EMBL and Edda Schulz lab at MPIMG we developed a Random Forest-based model to predict susceptibility of gene silencing from hundreds of genomic and epigenetic features.

Figure 3: Module detection and functional annotation on the lncRNA chromatin networkFigure 3 goes here


Deep learning methods for systems biomedicine

Despite the vast increase of high-throughput molecular and patient data the prediction of important disease genes and the underlying mechanisms of multi-factorial and heterogeneous diseases remains a challenging task. In our lab we are currently developing machine learning classifiers based on graph Convolutional Networks for disease gene prediction and patient stratification. While we currently focus on pan-cancer analysis or data from specific cancer patients, we aim in the future at extending our method to integrate heterogeneous data from other complex diseases, such as Type II Diabetes.

In addition, in cooperation with Prof. Erika v Mutius at the Children Klinik LMU Munich we started developing machine learning methods for asthma patient stratification  from a cohort of individuals of unprecedented large size integrating both clinical and genomic data. 

Our models can lead in the future to considerable advantages in precision medicine and to the development of personalized diagnostic and therapeutic approaches.