PARSESNP Introduction

PARSESNP (Project Aligned Related Sequences and Evaluate SNPs) is a web-based tool for the analysis of polymorphisms in genes. It determines the translated amino acid sequence from a reference DNA sequence (genomic or cDNA) and a gene model, and the effects of the supplied polymorphisms on the expressed gene product. If a homology model is provided, predictions can be made as to the severity of missense changes.

Variants can be read in from a number of databases, including HGMD, SwissProt, and dbSNP. When using variants from these databases, be sure to check that the reference DNA sequence being used corresponds exactly to the variants being processed; numbering, especially in protein sequences, often can be inconsistent. When they are included in a GenBank record, variants will also be read in from an NCBI URL or GenBank file (even if entered through the input preprocessor). In addition, users can enter their own variants manually, both through the variant file upload option and through the web form that is presented after the first PARSESNP form submission. If more than five variants are to be entered manually in this way, change the "No. of variants to enter by hand" field to the appropriate value before submitting the initial form.

Variants can be entered in any of the following formats:

Nucleotide change at known position
Entered as reference nucleotide, position in reference DNA sequence, variant nucleotide, for example A105T.
Ambiguous nucleotides may be used for the varaint nucleotides. A change back to the reference nucleotide will be ignored, so the variant A105R is the same as A105G (expect that it will be marked as heteozygous in the zygosity column).

Amino acid change at known position
Entered as reference amino acid, position in reference amino acid sequence, variant amino acid, for example M17V.
Be sure to check the protein variant box in the rightmost column of the form, or to put a printing character in the third column of your file, if using this format. The use of ambiguous amino acid codes is not supported. If the amino acid change indicated cannot be caused by a single nucleotide change, it will be ignored; if it can be caused by multiple single nucleotide changes, each change will be displayed separately.

DNA with exactly one change relative to the reference sequence
Entered as that string of DNA, for example ATGATGATG.
If the string doesn't match the reference sequence with just one change, or if it matches in multiple places, an error will be reported. An ambiguous nucleotide can be used to indicate the change; one of the nucleotides it codes for must be the one in the reference sequence, the others will be introduced as separate variants.

DNA sequence with explicit change
Entered as a section of sequence prior to the change, a [, the reference sequence for the changed region, a /, the variant sequence, a ], and finally a section of sequence following the change, for example ATG[A/G]TGTAA.

Insertions
Insertions can be indicated using several of the formats above, simply put more that one nucleotide in the variant sequence section, for example A105ATT or ATG[A/ATT]TGTAA.

Deletions
Deletions are supported using several of the formats above. Use a : to indicate a deletion in the 'nucleotide change at known position' format, for example A105:; deletions of more than one nucleotide can be indicated by putting a number indicating the number of bp to be deleted following the colon, for example A105:3. That could also be represented by putting the entire region to be deleted before the position, for exampel ATG105:.
Deletions can also be entered using the 'DNA sequence with explicit change' format. Here a - (to be consistent with dbSNP) should be used to indicate a deletion, for example ATGATG[A/-]TGTAA. Deletions of multiple nucleotides can be indicated by putting the entire region to be deleted before the slash, as in ATGATG[ATG/-]TAA.

Multiple nucleotide changes
Multiple nucleotide changes can be entered using the same formats as the previous insertions and deletions; both the reference sequence and the variant sequence sections can be more than one nucleotide in length, and there is not requirement that they be the same length. For example, both ATG105TAG and ATG105TAGG are valid. Similarly, both ATG[ATG/TAG]TAA and ATG[ATG/TA]TAA are valid.

Combinations of simpler changes
Combinations of simpler changes, separated by commas, are also allowed; they will be considered together when determining restriction enzyme polymorphisms, but each codon will be considered separately when determining missense changes. For example, ATG105TAG above could also have been entered as A105T,T106A.
The second box on the web form used to enter the variants (and the second column in a text file) is the description of each change. Users are encouraged to enter a description here, since variants are reported in the order in which they appear on the sequence, not the order in which they are entered.

The PARSESNP output can seem a bit intimidating at first glance, but it's really quite harmless. At the top of the page is the gene name entered by the user, followed by a list of the Blocks families that were used as a homology model. This is followed by one or more images showing the locations of polymorphisms on the genomic and coding sequence of the gene.

The images of variants positioned on the gene read from left, the start of the gene, to right, at the end of the gene. The first line of the image, the green boxes with a line through the middle, show the locations of Blocks on the gene. If a block spans an intron, the middle line continues through the intron, but the top and bottom lines are only present in exonic sequence. The second line, the orange boxes connected by lines, shows the location of coding exons on the gene. The boxes are the exons, and the thin lines represent the introns. In a graph of the coding sequence, the locations of introns are represented by vertical orange lines, but the introns themselves are not shown. The third section of the graph shows the location of the polymorphisms. A polymorphism in an exon is represented by an upward-pointing triangle, while a polymorphism in an intron is represented by a downward-pointing triangle. The first row of triangles, colored red, shows the location of nonsense and splice junction changes. The second row, colored black, shows the location of missense changes, and the third row, colored purple, shows the location of silent changes. The total length of the sequence displayed on the graph is shown at the end of the sequence.

This is followed by a table of the variants, in the order they are found on the sequence. For each variant, there is a link to the location of the change on the genomic sequence (the "G" link) and, if the variant is in a coding region, on the cDNA sequence (the "C" link). Each variant also shows the change in nucleotide sequence, the effect on translation or splicing, and a list of restiction enzyme polymorphisms caused by the change. If a Blocks family was provided as a homology model, the PSSM difference score is shown for a missense change that falls within a Block; similarly, if a protein sequence alignment containing the reference sequence was provided, SIFT scores are provided for each missense change. This is followed by the user-supplied description of the change, or a statement of how the variant was entered if that isn't immediately obvious from the nucleotide change or effect columns. The final column lists the zygosity of the change; changes entered using an ambiguous nucleotide are considered heterozygous, others are homozygous.

The table is followed by a link to download the information from the table in a tab-separated-value (TSV) text file. If a Blocks model was provided and variants were found in a region covered by a Block, an option to search 3D Blocks to view the variants on a 3D protein structure is provided. If a Blocks model was not supplied, the user has the option of submitting one and reprocessing the submitted polymorphisms.

The last element of the PARSESNP output is the detailed display of the reference sequence with the polymorphisms shown. Both the genomic and coding sequences are shown, introns are represented by a series of continuous lower case nucleotides in the genomic sequence, and intron locations are represented by a vertical bar in the coding sequence. The main portion of the display is a series of lower case letters, the DNA sequence, with a series of upper case letters, the amino acid sequence, above it. The DNA sequence is broken up into codons in coding regions. At the end of each line of the amino acid sequence is the position of the last amino acid shown, and at the end of each line of the DNA sequence is the position of the last nucleotide shown.

Block hits are shown on the amino acid sequence. The name of the Block, the MAST p-value for the match between the Block and the reference sequence, and the information content of the Block are all shown on the line above the amino acid sequence. The amino acids in the block are shown as underlined amino acids; note that Blocks may be interrupted by an intron, in which case the underlining stops for the duration of the intron. How well the reference sequence matches the block is shown by the coloring of the amino acids in the reference sequence; those colored green are most similar to the corresponding column in the aligned Block and have a PSSM score greater than 2, those colored red have a PSSM score less than 0, and those colored black have an intermediate score.

Variants are shown below the DNA sequence. The nucleotide change is shown directly under the affected nucleotide. This is followed, for changes in coding regions, by an indication of the effect of the change of the form original amino acid, amino acid position, and new amino acid (* for stop codon). This is followed by a number identifying the variant; the same identifier is used in the table. The variant display is colored according to the severity of its effect. Changes to a stop codon and splice junction changes are colored red, and silent changes are colored black. If a missense change is in a Block, it is colored according to its PSSM difference score: if the score is less than 0, indicating that variant residue is more similar to the corresponding column of the Blocks alignment than the reference residue, then the change is colored green; if it's greater than 10, it's colored red, otherwise it is colored black. Missense changes outside of a Block are also colored black.

For more help filling out the PARSESNP web form, please visit the PARSESNP glossary.


Created 14 January 2003, last modified 1 April 2003

© 2003 The proWeb Project.