Using a computational molecular evolution and phylogeny to study pathogenic proteins


Background: Studying bacterial physiology, adaptation, and pathogenicity through the lens of evolution requires delineating the phylogenetic history of bacterial proteins and genomes. Moreover, delineating this history of proteins is best done at all three levels, sequence, structure, and function, and comparative pangenomics. However, currently, there are no unified frameworks that experimentalists can avail to seamlessly perform all these analyses for their proteins of their interest. Using these software/tools requires numerous skills/resources: install individual tools; download/preprocess massive datasets; match disparate input-output data formats across different tools; computational resources for efficient analyses; computational know-how to run command-line tools; intuitive result summarization and visualization. Approach: To address this challenge, we have developed a computational framework for comprehensive evolutionary analysis that systematically integrates multiple data sources for gleaning sequence-structure-function relationships. Our framework goes beyond simple sequence comparisons by delving into constituent domains, domain architectures, phyletic spreads, and tracing the evolution across the tree of life. By adding a critical step of comparative pangenomics to these analyses, we can pinpoint molecular and genomic features that are unique to bacterial groups of interest (e.g., specific pathogens), which can then help prioritize candidate molecular targets, even in poorly characterized bacterial genomes. Results: We have developed a computational approach for molecular evolution and phylogeny that researchers can perform homology searches across the tree of life and reconstructs domain architecture by characterizing the input proteins and each of their homologs by combining: sequence alignment and clustering algorithms for domain detection; profile matching against protein domain/orthology databases; and prediction algorithms for signal peptides, transmembrane regions, cellular localization, and secondary/tertiary structures. We are adding an additional layer of comparative pangenomics to identify unique features for bacterial groups of interest (e.g., pathogens vs nonpathogens). An instance of the web-app, applied to study a large number of Psp stress response proteins (present across the tree of life) can be found here: The web-app described here is a general web-server that researchers can use to study any protein(s) of interest. To demonstrate the versatility of this framework, we are currently applying it to zoonotic pathogens causing severe and chronic pathologies in humans and animals, for e.g., host-specific factors in Nontuberculous Mycobacteria, nutrient acquisition systems in Staphylococcus aureus, and surface layer proteins in Bacillus anthracis. We have implemented this computational approach for molecular evolution and phylogeny (MolEvolvR) as open-source software (R-package) and a streamlined easy-to-use web application ( that will enable us and other researchers to prioritize candidate genetic factors in their application of interest for experimental validation.

May 13, 2021 8:00 PM — 9:00 PM

Related: Samuel Z Chen GLBio poster [video])(