More About This Database

"in specialibus generalia quaerimus"

- To seek the general in the specifics

Pre_GI database is a novel web-accessible, dynamic and comprehensive repository of prokaryotic genomic islands. It provides an easy-to-use interface which allows users the ability to query against the database with a variety of fields, parameters and associations. The Pre_GI database is constructed to be a web-resource for the analysis of ontological links between genomic islands and stratigraphic analysis of the global fluxes of mobile genetic elements through bacterial taxonomic borders. Comparison of newly identified genomic islands against the extant database provides the user the ability to identify their ontology, origin and relative time of acquisition. Existing genomic island databases address specific horizontal gene transfer questions to their research. Pre_GI aims to aid research on horizontal gene transfer and genomic islands through investigations on the entirety of fluxes of genetic information through ecological niches and taxonomic boundaries.

Genomic islands prediction

The SeqWord Genomic Island Sniffer standalone program was used to facilitate a large scale analysis of prokaryotic genomes to identify genomic islands in sequences obtained from ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/. This automated computational tool allows for the identification of horizontally transferred genomic elements in bacterial and plasmid DNA and is freely available for download at SeqWord Project.

Genomic islands compositional comparison

oligonucleotide usage pattern calculations and comparison algorithms introduced in our previously published work have illustrated that levels above 75% oligonucleotide usage pattern similarity propose a common ancestry of genomic islands or at least their involvement in common reticulation events. Oligonucleotide usage pattern similarity between genomic islands is determined by comparing the list of constituent k-mer, 4-mers in this study, for each genomic island ordered by oligonucleotide frequencies. One hundred percent oligonucleotide usage pattern similarity between genomic islands is indicated by both genomic islands having the same ordered oligonucleotide frequency list. Zero percent oligonucleotide usage pattern similarity between genomic islands is therefore indicated by a genomic island having the reverse order oligonucleotide frequency list to another genomic island. Oligonucleotide usage pattern similarity between genomic islands was used to pronounce compositional similarity.

Genomic island sequence comparison

Sequence similarity for the initial set of genomic islands was obtained with all-against-all BLASTN for the entire genomic island nucleotide sequences. All-against-all BLASTP was performed for all coding sequences contained within all genomic islands.

Genomic island clustering

Clustering was performed by the markov clustering algorithm using oligonucleotide usage pattern similarity values between genomic islands as relational scores. Initially oligonucleotide usage pattern similarity values between genomic islands of below 75% were used as a cutt-off to ensure removal of random links. The influence of biased over-representation of genomic islands from closely related species, e. g. Escherichia coli, was reduced by the implementation of a novel upper threshold of 85% oligonucleotide usage pattern similarity to ensure the removal of duplicate genomic islands. Clusters with more than 50 genomic islands were subclustered to ensure the production of distinct and significant subgroups.

Genomic island cluster/subcluster representatives

Genomic island representatives for each cluster/subcluster were needed to facilitate the management and amendments of the database. Representatives were designated as the nodes with the highest number of edges in the specific cluster/subcluster, i. e. the genomic island with the maximum number of oligonucleotide usage pattern similarity values between 75% and 85% to other genomic islands in a cluster/subcluster. Large clusters/subclusters required multiple representatives as diverse members of a cluster/subcluster may not share any significant oligonucleotide usage pattern similarity. Cluster/subcluster genomic island representatives were tested to identify genomic islands in cluster/subcluster not linked to the specific cluster/subcluster genomic island representative by an oligonucleotide usage pattern similarity of at least 75% to ensure an omnipresent set of genomic island representatives. These elements were then, intra-group, compared in an all-against-all oligonucleotide usage pattern similarity search and again re-clustered by markov clustering algorithm to identify an alternate representative. Certain clusters/subclusters may thus contain more than one representative to ensure all members of the specific grouping are accounted for by a representative.

Proposed Genomic islands movement

The identification of donor-recipient movement is grounded in the assumption that the process of amelioration alters the genomic island nucleotide composition from time of insertion to reflect that of the host in which it occurs, yet for an extended time after insertion a genomic island may be traced back to its origin by preserving compositional homomorphism with the donor. GOHTAM implemented a similar approach of ascertaining the origin of mobile genetic elements. This approach was applied in Pre_GI with the comparison of oligonucleotide usage pattern similarity values calculated for homologous genomic islands hosted by different organisms to predict donor-recipient relationships. Significant oligonucleotide usage pattern differences of homologous genomic islands to that of hosts would indicate flux. High oligonucleotide usage pattern similarity to a host with a lower oligonucleotide usage pattern similarity to another host indicates a likelihood that the later host is the recipient of a genomic island from the former host.