Online Datasets for Microbial Comparative Genomics

You are a student researcher and have identified an interesting protein with a fabulous function. You’d like to know what other organisms have this protein, and whether or not they are fabulous as well. However, the only tool you know of (yet) is the NCBI’s BLAST web server. The BLAST web server is very helpful and a good first stop. However, the BLAST web server:

Uses an algorithm that is bad at identifying proteins from the same clade that are highly dissimilar;
Can return pages of highly similar results if the gene has been sequenced in many organisms and you do not exclude the appropriate groups;
Does not provide an easy way, in my experience, to download genomes containing BLAST results.

Fortunately, you have other options.

Below are the databases that I tend to use, in order of their approachability to a researcher who is comfortable using command line tools and searching for proteins with HMMs (or command line BLASTP). I encourage you to learn how to use the command line (some resources here) and to build and use protein HMMs in your work. I detail some of this in a tutorial about searching large datasets. If you have not developed those skills yet, you have a different definition for which of these is approachable.

Databases

A specific protein family

https://pfam.xfam.org/family/PF00384 (example family: DMSO reductases)

BLAST can help you identify the family your protein (or its domains) belongs to. From there, you can download various sets of proteins in the family and search for similar proteins on your computer. It is particularly useful to see if your protein groups with other proteins in a phylogenetic tree; you can then use a collection of proteins from a group to create a model (HMM) for your protein, aiding future searches.

An online web service using HMMs

https://www.ebi.ac.uk/Tools/hmmer/search/hmmsearch

EMBL-EBI’s hmmsearch allows you to search many databases, such as UniProtKB, using an HMM that you upload. (If you only have one protein and don’t want to build an HMM, phmmer is the tool for you). By adjusting reporting thresholds, you can limit the search to hits and use the convenient download options to collect similar proteins and their metadata.

A collection of proteomes representative of genetic diversity

https://proteininformationresource.org/rps/

Using “representative proteomes” [ref] is great if you to understand the distribution of your protein across the tree of life. The service grouped genomes by similarity at different levels, then selected a representative proteome (collection of proteins from the genome) for each group. For example, RP55 provides about one proteome per genus. The RPG file provides information about the proteomes, like strain name, that can be helpful in guiding your analysis. Also, since you have the entire proteome present, you can search for other proteins on the same set of organisms. Personally, I think this dataset is underutilized.

Note: if you want to visualize this distribution, the online tool AnnoTree is smartly designed to do just that.

Specific taxa from RefSeq and GenBank

https://github.com/kblin/ncbi-genome-download

Kai Blin’s tool for downloading genomes is not a database itself but a convenient and well-documented way to access RefSeq or GenBank. You can, for example, painlessly download all ~200 genomes in the genus Bradyrhizobium, then search those for, say, a chlorite:O2 lyase. Just be careful about data volume, particularly in highly sequenced lineages such as enteric pathogens. If you want, you can download all viral, bacterial, and archaeal genomes in RefSeq or GenBank.

All JGI IMG genomes

https://img.jgi.doe.gov/

IMG is like a great neighborhood bakery that only lets you buy one croissant at a time. I’m referring to the inability to search through >500 (meta)genomes at once (after you’ve create a Genome Set for those genomes), which I’m guessing was a conscious trade-off between better service for regular users and worse service for jerks trying to download the entire tree of life. (“Hurry up with my damn croissants!”). For our purposes, a clever workaround would be searching and downloading proteins annotated with a particular pfam, but many proteins belonging to a pfam lack the appropriate annotation. Not good if you want to identify new proteins.

Fortunately, you can use JGI’s instructions to download sequences (genomes, proteins, etc.) in bulk. The bulk download provides several files for each genome, including proteins (.faa), and is rather helpful. Note that according to a 2013 forum post, only genomes sequenced at JGI are available for download. Many of these genomes are not also uploaded into RefSeq or GenBank, so it can pay to search both if you want a full catalogue of available genomes.

What next?

Your choice of database and your analysis of it will depend on the scientific question you are trying to answer.

Often, I want to know where a group of new proteins fits into a preexisting tree, so I simply use a Pfam. For one project, I wanted to know the full extent of the gene’s diversity, so I searched pretty much every database. I also wanted to know what functions an accessory gene was involved in, so I used Python to parse genome files to get nearby genes (followed by many further analyses). For another project, I wanted to know if genes for a pathway were found together across the tree of life, so I used representative proteomes. If you are hesitant to learn programming (please don’t be!), some of this can be done on online web services for comparative genomics like IMG.

In my opinion, you’re likely looking for a protein because of its function, so when you consider its phylogenetic diversity you should also consider (1) the conservation of nearby genes that could be related in function and (2) the presence or absence of key residues, if they are known.

Get nearby genes (some Python helps) and seeing if they have homology using something like all-v-all comparisons or clustering. I use in-house scripts that I am still developing, but one published example of how this is done is here: https://github.com/ryanmelnyk/PyParanoid.
Create a protein alignment and compare positions to a characterized protein. Again, using Python can help, for example by using regular expressions to find the key positions.

Whatever you analysis is, you should validate your hits. Using a second technique to identify related proteins, such as building a phylogenetic tree, helps you remove false positives from a single technique alone.

Comparative genomics projects can be incredibly fun and can provide important insights, especially when you’re starting from a new protein. The projects have also been scary reminders of how diverse the microbial world is, and how much effort is needed to fully characterized it.