Elizabeth A. Greene & Steven Henikoff
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. North
Seattle, Washington 98109-1024, USA
(eagreene@fhcrc.org and steveh@muller.fhcrc.org)
Reprinted from: Nature Genetics Online
The World-Wide Web has literally changed the way we look at genes and proteins. In the past, we ran retrievals, searches and analyses on our local computers, but this required constant updating of databases and upgrading of programs. Now, with a link to the internet, each of us has at our disposal up-to-date databases and search tools. Web browser pages provide a consistent, easily mastered interface to many search tools; hyperlinks allow rapid navigation through layers of information; and databases and programs, interrogated at their source, are guaranteed to be current.
Entrez is the primary retrieval tool for molecular biologists. Just as general web search engines index the web, the Entrez browser indexes databases important to biologists: the sequence and structure databanks and the PubMed collection of citations and abstracts. A major strength of Entrez is integration among databank entries. In addition, neighbouring search-algorithms provide ready access to sequences and publications related to a query.
When the retrieved sequence is protein, SwissProt provides a rich source of hyperlinks to a variety of external sources of knowledge. For example, the entry for pituitary-specific transcription factor, POU1F1, includes links to GeneCard and OMIM which provide genetic and physical map positions, phenotypes and expression data. Other links to PROSITE lead to descriptions of POU and homeobox modules present in the POU1F1 protein, and a link to ProDom provides a graphic that illustrates how these POU1F1 modules align with homologous modules in other proteins. A link to Swiss-Model leads to a predicted 3D structure for POU1F1 that was obtained by modelling the sequence on the known crystal structure of POU2F1. Many specialized databases are available via links from Swiss-Prot entries, such as the TRANSFAC database, which summarizes the DNA-binding specificity of the POU1F1 transcription factor.
Retrievals of either multiple sequences or entries from multiple databases, can be accomplished over the web. Both Entrez and Expasy have batch retrieval modes, and by filling out a sequence retrieval service (SRS) form, a user can probe multiple selected databases using simple or complex queries.
Similarity searches of the general sequence databanks, such as those performed by BLAST and FASTA, are so familiar to biologists that no description is necessary. Nevertheless, the rapid growth of sequence databanks means that searches are often better limited to subsets. The fragmentary nature of ESTs and their disproportionate increase in number led to the separation of dbEST from GenBank, and the organism-specific subsets of dbEST have been themselves condensed into TGI.
Sharpening the focus of a search both reduces the size of the output and lowers the background of chance hits. This might be accomplished by searching a specific division of a general sequence database or limiting a search to just the latest updates. In some cases, organism-specific databases can be searched, and this may be desirable for biological reasons as well. An added advantage to organism-specific searches is that large-scale projects dedicated to genomic sequencing often make the latest data available for public access on their own servers prior to GenBank submission, however, the location of these resources is often unknown to potential users or is hard to find. To address this problem, we have collected direct links to organism-specific search engines. We also make available the DIYDb (Do It Yourself) BLAST service for comparing a sequence of interest to a users' own database. To our knowledge this is the first database searching service that provides no data.
Sequence databanks are increasing in size and redundancy, but the number of protein families has been leveling off. This situation increases the value of family-specific databases, both for searching sensitivity and for predicting structure and function. Protein family features in a sequence of interest can be efficiently identified by searching it against any of several family-specific databases. Blocks and Prints detect local regions of similarity, and ProDom, Pfam and ProfileScan detect end-to-end similarity. Conversely, regions of sequence similarity characteristic of a family can be used to detect more distant homologues in the sequence databanks. These regions are identified from multiple sequence alignments as performed by ClustalW for sequences that align from end to end and by BlockMaker, MEME and Match-Box which use motif-based methods. These tools are all accessible from a single form at the BCM launcher, which also provides forms for numerous useful sequence analysis procedures. To detect distant homologues, the MAST searcher uses output directly from MEME and BlockMaker to query current protein sequence databanks. In contrast, PSI-BLAST starts with a single sequence and combines multiple alignment with databank searching by grabbing the best hits in a conventional BLAST search and using the regions of alignment in subsequent iterative BLAST searches.
To get the most from an alignment, informative displays are essential. LANLview depicts a pairwise alignment as a cartoon that is colour-coded to illustrate the degree of conservation. Boxshade highlights conserved residues within multiple sequence alignments. Logos display positions of multiple alignments as stacks of residue letters whose heights are indicative of the degree of conservation. Evolutionary information is traditionally displayed as trees derived from multiple alignments, which are valuable for discerning subfamily relationships.
A variety of structural features can be predicted by analysing single sequences. Some unstructured regions are detected as compositionally biased segments by the SEG filter that removes them from BLAST queries. It is also useful to filter out coiled-coil regions, which can now be confidently predicted from protein sequence. Internal repeats can be identified simply by self-alignment. Transmembrane spanning regions are detected by measuring hydrophobicity. Secondary structural elements within a protein are predicted with just over 70% accuracy when multiple sequence alignments are available, and these predictions might aid in evaluating structural or functional inferences.
For proteins that have structures available, CATH and SCOP give fold classifications. Direct comparisons between related structures are carried out most vividly using VAST, which allows display and manipulation of 3D structural superpositions. Only a few years ago, the ability to easily navigate 3D structures required an expensive workstation with specialized software and support. It is remarkable that a web browser now provides sophisticated 3D structural displays to each of us at no cost.
Databases are constantly being updated, so how can you keep up? Register your sequence with an alerting service, and it will inform you by e-mail that something of interest has occurred. It might be that your sequence of interest has a new homologue, or that it has been mapped. The Sequence Alerting System performs daily searches of new protein sequence databank entries and alerts to new homologues. When a homologous sequence is entered into the fully annotated and strictly non-redundant Swiss-Prot database, Swiss-Shop sends you an alert. Xref sends an alert if your sequence is hit by one that has been mapped.
The Internet, and particularly the web browser, freed us from the burdens of installing, maintaining and upgrading special software and databases for research. The burden of thinking remains, even with this selective list of current tools, so caveat emptor. Fortunately, tools continue to improve, and sites become better integrated. As a result, time that we used to spend struggling with software for sequence analysis, and later surfing the web, we can devote to thinking about what the analysis is telling us.