Skip to main content
Machine Learning Algorithms
Lila Patel - stock.adobe.com

Our most recent publication is about inferring protein from transcript abundances using convolutional neural networks

INET,

Schwehn, P.M., Falter-Braun, P. 

Inferring protein from transcript abundances using convolutional neural networks. 

BioData Mining 18, 18 (2025). doi.org/10.1186/s13040-025-00434-z

 

Background

Although transcript abundance is often used as a proxy for protein abundance, it is an unreliable predictor. As proteins execute biological functions and their expression levels influence phenotypic outcomes, we developed a convolutional neural network (CNN) to predict protein abundances from mRNA abundances, protein sequence, and mRNA sequence in Homo sapiens (H. sapiens) and the reference plant Arabidopsis thaliana (A. thaliana).

Results

After hyperparameter optimization and initial data exploration, we implemented distinct training modules for value-based and sequence-based data. By analyzing the learned weights, we revealed common and organism-specific sequence features that influence protein-to-mRNA ratios (PTRs), including known and putative sequence motifs. Adding condition-specific protein interaction information identified genes correlated with many PTRs but did not improve predictions, likely due to insufficient data. The integrated model predicted protein abundance on unseen genes with a coefficient of determination (r2) of 0.30 in H. sapiens and 0.32 in A. thaliana.

Conclusions

For H. sapiens, our model improves prediction performance by nearly 50% compared to previous sequence-based approaches, and for A. thaliana it represents the first model of its kind. The model’s learned motifs recapitulate known regulatory elements, supporting its utility in systems-level and hypothesis-driven research approaches related to protein regulation.