The loose evolutionary relationships between transcription factors and other gene products across prokaryotes
© del Grande and Moreno-Hagelsieb; licensee BioMed Central. 2014
Received: 25 November 2014
Accepted: 10 December 2014
Published: 17 December 2014
Tests for the evolutionary conservation of associations between genes coding for transcription factors (TFs) and other genes have been limited to a few model organisms due to the lack of experimental information of functional associations in other organisms. We aimed at surmounting this limitation by using the most co-occurring gene pairs as proxies for the most conserved functional interactions available for each gene in a genome. We then used genes predicted to code for TFs to compare their most conserved interactions against the most conserved interactions for the rest of the genes within each prokaryotic genome available.
We plotted profiles of phylogenetic profiles, p-cubic, to compare the maximally scoring interactions of TFs against those of other genes. In most prokaryotes, genes coding for TFs showed lower co-occurrences when compared to other genes. We also show that genes coding for TFs tend to have lower Codon Adaptation Indexes compared to other genes.
The co-occurrence tests suggest that transcriptional regulation evolves quickly in most, if not all, prokaryotes. The Codon Adaptation Index analyses suggest quick gene exchange and rewiring of transcriptional regulation across prokaryotes.
Using data derived from literature on experimentally determined molecular interactions, previous work suggested that gene and gene product relationships brought about through transcriptional co-regulation in Escherichia coli K12 MG1655, have loose evolutionary conservation [1–3]. Such results suggest that transcriptional regulation might evolve quickly, an idea that gains support from other results, such as those suggesting that at least half of the transcription factors (TFs) present in E. coli might come from horizontal gene transfer .
Profiles of phylogenetic profiles, p-cubic, can provide information about the quality of functional interaction datasets , and about the evolutionary conservation of known functional interactions . As mentioned in the paragraph above, the work on evolutionary conservation has relied on experimentally determined interactions, such as those gathered in knowledge databases like RegulonDB  and EcoCyc . Knowledge databases are not readily available for most other genomes. It is therefore not possible to further test previous results in other genomes in the same way. While TFs might be determined by the presence of DNA-binding domains in encoded proteins, their target genes, for example, would not be known. However, the co-occurrence across genomes of genes coding for TFs with other genes can be measured, and we reasoned that maximally co-occurring genes might still reflect the evolutionary stability of interactions between TFs and other genes in any given genome (see further explanations under Results and discussion).
In this work we show evidence suggesting that the most evolutionarily conserved interactions for the sets of predicted TFs across a wide sample of publicly available prokaryotic genomes are less conserved than the most conserved interactions among all other gene products.
Genomes and phylogenetic profiles
Using a web-based tool , we selected a non-redundant genome dataset filtered using a genomic similarity score [3, 8, 9] chosen to keep the equivalent of one genome per represented species (G S S a=0.90) out of the 2733 prokaryotic genomes available at the RefSeq database  (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/) by the end of December 2013. We further filtered this non-redundant genome dataset to keep genomes longer than 2.5 Mbp, with at least 80 genes coding for transcription factors other than sigma factors.
Profiles of phylogenetic profiles (p-cubic) [3, 5], are graphs representing the proportion of genes left at different thresholds of MI. Briefly, these graphs are anti-cumulative plots showing the decline in the proportion of pairs of genes left at increasing MI thresholds. Pairs of genes whose interactions are highly conserved should have higher MI than those with poorly conserved ones. Therefore, if a group contains a higher proportion of highly conserved interactions, their p-cubic line should tend to drop at a slower rate than the p-cubic of a group with less conserved interactions [see Figure one in ]. These graphs are very similar in concept to those presented previously by Date and Marcotte .
Experimentally determined TF datasets were obtained as follows: for Escherichia coli strain K12-MG1655 we downloaded the lists of predicted and manually curated TFs from RegulonDB , for Bacillus subtilis strain 168 we downloaded the lists of TFs from the DBTBS .
To identify TFs in the rest of the genomes in our study, we downloaded the lists of TF-related Pfam and Superfamily identifiers from the DNA-Binding Domains database (DBD) . We then ran hmmer (version 3.1b1)  comparisons of all the annotated proteins for each genome against the Hidden Markov Models of these domain families. The domain families were extracted with the hmmfetch program from the Pfam-A.hmm file from the Pfam database (version 27, http://pfam.janelia.org/) , and from the Superfamily hmmlib file (version 1.75; http://0-supfam.cs.bris.ac.uk.brum.beds.ac.uk/SUPERFAMILY/downloads.html) . For Pfam families we used the hmmscan --cut_ga option which uses the “gathering threshold” set for each family in Pfam. For Superfamily we downloaded as many pre-annotated genomes as available at the database. For other genomes we set an hmmscan maximum domain e-value of 0.0001 (--domE 1e-4), then filtered out the hmmer results using the ass3.pl script available from the Superfamily download site. We ensured that this procedure was adequate by running a couple of the pre-annotated genomes and verifying that we had the very same results.
Evaluation of p-cubic differences
Values of Δ P 3 > 0 indicate that the p-cubic curve for TF-coding genes fell below the curve for non TF-coding genes, indicative of TFs forming looser associations than other gene products. Values of Δ P 3 ≤ 0 indicate either equal association strength (Δ P 3 = 0), or TF-coding genes forming stronger associations (Δ P 3 < 0). After filtering, this analysis was performed on 790 prokaryotes, including some Archaea.
Codon Adaptation Index
We determined the Codon Adaptation Index (CAI)  for each gene within each of the non-redundant genomes chosen above (0.90 GSSa). The CAI compares the codon usage of a protein-coding gene against the codon usage of highly expressed genes (HXGs). The best examples of HXGs are those coding for ribosomal proteins. To find ribosomal proteins we used the COG and arCOG ribosomal protein families described by Yutin et al.. These COGs and arCOGs were matched to their corresponding bacterial and archaeal genomes and gene identifiers using the files provided by the authors (ftp://ftp.ncbi.nih.gov/pub/wolf/COGs/). If a genome in our database was not in those files, we checked the COG annotations provided with the genomes as downloaded from NCBI and/or compared, using the rpsblast program (part of NCBI’s BLAST+ suite) , the encoded proteins of each genome to the profiles for the appropriate COGs found at the Conserved Domains Database database . The rpsblast program was run with soft-masking (-seg yes -soft_masking true), a Smith-Waterman final alignment (-use_sw_tback), and a maximum e-value threshold of 1×10-6 (-evalue 1e-6).
To calculate the codon usage tables of the HXGs (the ribosomal protein-coding genes chosen above) of each genome, we used the program cusp from the EMBOSS software suite . We then used these HXGs codon usage tables to calculate the CAI for each protein-coding gene within the appropriate genome using the cai program also from the EMBOSS software suite.
Results and discussion
Top-scoring interactions in model organisms suggest that both experimentally-known and predicted TFs have less conserved interactions than other genes
Ideally, we would analyze the p-cubic of TFs and the genes in their target transcription units. However, databases containing enough literature-based data exist only for a few model organisms. Therefore, we developed computational strategies for gathering data of sufficient quality to perform these analyses across available genomes. We needed two kinds of data: (a) TFs and (b) their target genes.
Our strategy towards finding TFs consisted of downloading manually curated datasets [6, 16], predictions produced by other authors [6, 16, 20], as well as comparing the annotated proteins of each genome against previously described DNA-binding Pfam and Superfamily domains as described in the DBD database . We kept only the genomes containing at least 80 predicted TFs, where predicted TFs were genes coding for proteins matching the Pfam and Superfamily domains listed at the DBD database. This procedure reduced our prokaryotic non-redundant set from 950 to 857 (we provide tables of predicted TFs across the full set of 950 genomes used in this study as Additional files 1 and 2).
Once we found putative TFs in the genomes of interest we still needed target genes (TGs). Properly finding TGs can be quite a demanding task. However, we thought that we could still compare the conservation of more generic TF associations, not necessarily a TF gene to TG association, against the conservation of other gene associations. To this end we used mutual information (MI) as a measure of co-occurrence. We calculated the MI for every pair of genes in each of the 750 genomes selected above. For each gene, we selected its five top-scoring pairs as representatives of its most conserved interactions. For example, in E. coli K12, the gene lptD (gi|16128048) shows MI values against the other 4138 protein-coding genes in this genome ranging from 0 to 0.51, with the five top-scoring ones being 0.43, 0.44, 0.46, 0.48, and 0.51. We used these top-scoring values as representatives for the most evolutionarily conserved interactions of this gene.
The MI for both predicted and experimentally validated TFs from B. subtilis does not show as strong a difference to other genes as they do in E. coli. We do not have a full explanation for this difference in results. It could be that the interactions between TFs and target genes in B. subtilis is closer in evolutionary conservation to those of other genes. It could also be that the genomes in the database do not represent enough information to show the difference with enough emphasis. Finally, gene over-annotations in B. subtilis, as estimated by the SwissProt method proposed by Skovgaard et al.[25, 26], is higher for B. subtilis (16.6%) than for E. coli (5.32%). False genes would necessarily have no orthologs in other genomes, and their MI with other genes would necessarily be zero. Thus, false genes might lower the p-cubic curve of genes other than those coding for TFs (genes coding for TFs would most probably be true genes because their products match true protein domains).
Top-scoring interactions suggest that TFs have less conserved interactions than other genes among prokaryotes
The results above show p-cubic comparisons suggesting that TFs have less co-occurring, and therefore less conserved, interactions than other genes in model organisms (Figure 1). Since predicted TFs produced similar results to those obtained with manually-annotated TFs, we concluded that genes coding for predicted TFs in other prokaryotes would yield appropriate results to evaluate if TF-coding genes also show a tendency towards less evolutionarily conserved interactions than other genes in other prokaryotes.
To test for the conservation of functional associations between TF-coding genes and other genes in prokaryotes other than model organisms, we selected genomes at least 2.5 Mbp in length from NCBI’s RefSeq database (see Methods). We filtered small genomes because it is well known that prokaryotes with reduced genomes tend to lack TF-coding genes [27–29]. We also filtered out redundant genomes using a previously published method to cluster similar genomes and keeping only one as a representative , and rejected genomes with less than 100 genes coding for predicted TFs (see Methods).
Low CAIs suggest that genes coding for TFs tend to be horizontally transferred
In this work we presented data across several prokaryotic genomes suggesting that genes coding for TFs have evolutionarily loose relationships with other genes, and that genes coding for TFs have a tendency towards having low Codon Adaptation Indexes compared to other gene sets, suggesting that TF-coding genes are frequently horizontally transferred. Overall, these results suggest that transcriptional regulation evolves quickly among prokaryotes, and that the evolution of transcriptional regulation might be strongly tied to elements specializing in horizontal gene transfer, like pathogenicity and other genomic islands. It is therefore tempting to hypothesize that genomic islands might be of main importance in the evolution of transcriptional regulation.
Availability of supporting data
We provide predicted transcription factors across prokaryotic genomes used in this study at: http://microbiome.wlu.ca/TFs/
We thank The Shared Hierarchical Academic Research Computing Network (SHARCNET) for computing facilities. Work supported with a Discovery Grant to GM-H from the Natural Sciences and Engineering Research Council of Canada (NSERC). We thank Ernesto Pérez-Rueda for valuable comments and discussions.
- Babu M, Teichmann SA, Aravind L: Evolutionary dynamics of prokaryotic transcriptional regulatory networks. J Mol Biol. 2006, 358 (2): 614-633. 10.1016/j.jmb.2006.02.019.View ArticleGoogle Scholar
- Lozada-Chavez I, Janga SC, Collado-Vides J: Bacterial regulatory networks are extremely flexible in evolution. Nucleic Acids Res. 2006, 34 (12): 3434-45. 10.1093/nar/gkl423.PubMedPubMed CentralView ArticleGoogle Scholar
- Moreno-Hagelsieb G, Jokic P: The evolutionary dynamics of functional modules and the extraordinary plasticity of regulons: the escherichia coli perspective. Nucleic Acids Res. 2012, 40 (15): 7104-12. 10.1093/nar/gks443.PubMedPubMed CentralView ArticleGoogle Scholar
- Price MN, Dehal PS, Arkin AP: Horizontal gene transfer and the evolution of transcriptional regulation in escherichia coli. Genome Biol. 2008, 9 (1): 4-10.1186/gb-2008-9-1-r4.View ArticleGoogle Scholar
- Hu P, Janga SC, Babu M, Diaz-Mejia JJ, Butland G, Yang W, Pogoutse O, Guo X, Phanse S, Wong P, Chandran S, Christopoulos C, Nazarians-Armavil A, Nasseri NK, Musso G, Ali M, Nazemof N, Eroukova V, Golshani A, Paccanaro A, Greenblatt JF, Moreno-Hagelsieb G, Emili A: Global functional atlas of escherichia coli encompassing previously uncharacterized proteins. PLoS Biol. 2009, 7 (4): 96-10.1371/journal.pbio.1000096.View ArticleGoogle Scholar
- Gama-Castro S, Salgado H, Peralta-Gil M, Santos-Zavaleta A, Muniz-Rascado L, Solano-Lira H, Jimenez-Jacinto V, Weiss V, Garcia-Sotelo JS, Lopez-Fuentes A, Porron-Sotelo L, Alquicira-Hernandez S, Medina-Rivera A, Martinez-Flores I, Alquicira-Hernandez K, Martinez-Adame R, Bonavides-Martinez C, Miranda-Rios J, Huerta AM, Mendoza-Vargas A, Collado-Torres L, Taboada B, Vega-Alvarado L, Olvera M, Olvera L, Grande R, Morett E, Collado-Vides J: Regulondb version 7.0: transcriptional regulation of escherichia coli k-12 integrated within genetic sensory response units (gensor units). Nucleic Acids Res. 2011, 39 (Database issue): 98-105.View ArticleGoogle Scholar
- Keseler IM, Collado-Vides J, Santos-Zavaleta A, Peralta-Gil M, Gama-Castro S, Muniz-Rascado L, Bonavides-Martinez C, Paley S, Krummenacker M, Altman T, Kaipa P, Spaulding A, Pacheco J, Latendresse M, Fulcher C, Sarker M, Shearer AG, Mackie A, Paulsen I, Gunsalus RP, Karp PD: Ecocyc: a comprehensive database of escherichia coli biology. Nucleic Acids Res. 2011, 39 (Database issue): 583-90.View ArticleGoogle Scholar
- Moreno-Hagelsieb G, Wang Z, Walsh S, ElSherbiny A: Phylogenomic clustering for selecting non-redundant genomes for comparative genomics. Bioinformatics. 2013, 29 (7): 947-949. 10.1093/bioinformatics/btt064.PubMedView ArticleGoogle Scholar
- Moreno-Hagelsieb G, Janga SC: Operons and the effect of genome redundancy in deciphering functional relationships using phylogenetic profiles. Proteins. 2008, 70 (2): 344-52.PubMedView ArticleGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq,: a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007, 35 (Database issue): 61-5.View ArticleGoogle Scholar
- Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL: BLAST+: architecture and applications. BMC Bioinformatics. 2009, 10: 421-10.1186/1471-2105-10-421.PubMedPubMed CentralView ArticleGoogle Scholar
- Moreno-Hagelsieb G, Latimer K: Choosing BLAST options for better detection of orthologs as reciprocal best hits. Bioinformatics. 2008, 24 (3): 319-24. 10.1093/bioinformatics/btm585.PubMedView ArticleGoogle Scholar
- Ward N, Moreno-Hagelsieb G: Quickly Finding Orthologs as Reciprocal Best Hits with BLAT, LAST, and UBLAST: How Much Do We Miss?. PLoS ONE. 2014, 9 (7): 101850-10.1371/journal.pone.0101850.View ArticleGoogle Scholar
- Huynen M, Snel B, Lathe W, Bork P: Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res. 2000, 10 (8): 1204-1210. 10.1101/gr.10.8.1204.PubMedPubMed CentralView ArticleGoogle Scholar
- Date SV, Marcotte EM: Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages. Nat Biotechnol. 2003, 21 (9): 1055-1062. 10.1038/nbt861.PubMedView ArticleGoogle Scholar
- Sierro N, Makita Y, de Hoon M, Nakai K: DBTBS: a database of transcriptional regulation in Bacillus subtilis containing upstream intergenic conservation information. Nucleic Acids Res. 2008, 36 (Database issue): 93-6.Google Scholar
- Wilson D, Charoensawan V, Kummerfeld SK, Teichmann SA: DBD–taxonomically broad transcription factor predictions: new content and functionality. Nucleic Acids Res. 2008, 36 (Database issue): 88-92.Google Scholar
- Eddy SR: Accelerated Profile HMM Searches. PLoS Comput Biol. 2011, 7 (10): 1002195-10.1371/journal.pcbi.1002195.View ArticleGoogle Scholar
- Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, Heger A, Holm L, Sonnhammer ELL, Eddy SR, Bateman A, Finn RD: The Pfam protein families database. Nucleic Acids Res. 2012, 40 (Database issue): 290-301.View ArticleGoogle Scholar
- de Lima Morais DA, Fang H, Rackham OJL, Wilson D, Pethica R, Chothia C, Gough J: SUPERFAMILY 1.75 including a domain-centric gene ontology method. Nucleic Acids Res. 2011, 39 (Database issue): 427-34.View ArticleGoogle Scholar
- Sharp PM, Li WH: The codon Adaptation Index–a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 1987, 15 (3): 1281-1295. 10.1093/nar/15.3.1281.PubMedPubMed CentralView ArticleGoogle Scholar
- Yutin N, Puigbò P, Koonin EV, Wolf YI: Phylogenomics of prokaryotic ribosomal proteins. PLoS ONE. 2012, 7 (5): 36972-10.1371/journal.pone.0036972.View ArticleGoogle Scholar
- Marchler-Bauer A, Zheng C, Chitsaz F, Derbyshire MK, Geer LY, Geer RC, Gonzales NR, Gwadz M, Hurwitz DI, Lanczycki CJ, Lu F, Lu S, Marchler GH, Song JS, Thanki N, Yamashita RA, Zhang D, Bryant SH: CDD: conserved domains and protein three-dimensional structure. Nucleic Acids Res. 2013, 41 (Database issue): 348-52.View ArticleGoogle Scholar
- Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000, 16 (6): 276-277. 10.1016/S0168-9525(00)02024-2.PubMedView ArticleGoogle Scholar
- Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A: On the total number of genes and their length distribution in complete microbial genomes. Trends Genet. 2001, 17 (8): 425-428. 10.1016/S0168-9525(01)02372-1.PubMedView ArticleGoogle Scholar
- Moreno-Hagelsieb G, Hudy-Yuffa B: Estimating overannotation across prokaryotic genomes using BLAST+, UBLAST, LAST and BLAT. BMC Res Notes. 2014, 7 (1): 651-10.1186/1756-0500-7-651.PubMedPubMed CentralView ArticleGoogle Scholar
- van Nimwegen E: Scaling laws in the functional content of genomes. Trends Genet. 2003, 19 (9): 479-484. 10.1016/S0168-9525(03)00203-8.PubMedView ArticleGoogle Scholar
- Molina N, van Nimwegen E: Scaling laws in functional genome content across prokaryotic clades and lifestyles. Trends Genet. 2009, 25 (6): 243-247. 10.1016/j.tig.2009.04.004.PubMedView ArticleGoogle Scholar
- Cordero OX, Hogeweg P: Regulome size in Prokaryotes: universality and lineage-specific variations. Trends Genet. 2009, 25 (7): 285-286. 10.1016/j.tig.2009.05.001.PubMedView ArticleGoogle Scholar
- Schneider TD: Evolution of biological information. Nucleic Acids Res. 2000, 28 (14): 2794-2799. 10.1093/nar/28.14.2794.PubMedPubMed CentralView ArticleGoogle Scholar
- Nakamura Y, Itoh T, Matsuda H, Gojobori T: Biased biological functions of horizontally transferred genes in prokaryotic genomes. Nature Genet. 2004, 36 (7): 760-766. 10.1038/ng1381.PubMedView ArticleGoogle Scholar
- Koski LB, Morton RA, Golding GB: Codon bias and base composition are poor indicators of horizontally transferred genes. Molecular Biol Evol. 2001, 18 (3): 404-412. 10.1093/oxfordjournals.molbev.a003816.View ArticleGoogle Scholar
- Ragan MA: Detection of lateral gene transfer among microbial genomes. Curr Opin Genet Dev. 2001, 11 (6): 620-626. 10.1016/S0959-437X(00)00244-6.PubMedView ArticleGoogle Scholar
- Azad RK, Lawrence JG: Towards more robust methods of alien gene detection. Nucleic Acids Res. 2011, 39 (9): 56-10.1093/nar/gkr059.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.