CODDLE and PARSESNP Glossary
For an introduction to CODDLE, please see the CODDLE Introduction. For an introduction to PARSESNP and a description of the formats in which PARSESNP variants can be entered, please see the PARSESNP Introduction.
CODDLE and PARSESNP Input Fields
- Protein/Gene Name
- An optional identifier which is included in the output.
- Exon positions [a.k.a. CDS statement or gene model]
- The positions of the start and end of each protein coding exon in the submitted genomic sequence. These exon position or CDS statements are found in many genomic sequence entries in GenBank [for example, look in this 100+ Kb fragment of genonmic sequence and follow any of the hyperlinks labeled gene].
Entries should be in GenBank format (white space, '<' and '>' will be ignored)
join(start1..end1,start2..end2,start3..end3)
complement(join(start1..end1,start2..end2,start3..end3))
start1..end1
complement(start1..end1)
where start1 and end1 are nucleotide positions in the submitted genomic sequence.
A request with overlapping exons cannot be processed and will return an error message. A request with more than one CDS cannot be processed and will return an error message.
The start of the first CDS exon is assumed to be the first position in a codon, but does not have to be an ATG. Thus a user may need to 'pad' the input sequence with the nucleotide 'n', when working with a partial genomic sequence. If stop codons are predicted in the middle of the CDS [as a result of translating in the wrong frame, for example], an error message is returned. The end of the last CDS exon must be the third position in a codon. It can, but does not have to be a stop codon.
Whether working with CODDLE or PARSESNP, good results can only be had with an accurate gene model. When the positions of the CDS exons are not predicted or are mispredicted in the GenBank entry, the user must create the correct CDS statement. To do this a user needs, in addition to the genomic sequence, a protein sequence, or a nucleotide sequence based on a full length cDNA or on a reliable contig of EST sequences, or both.
Sequin is the most comprehensive tool available for extracting and reformatting gene models. Unlike the web-based methods, there's a (minimal) effort to installing and using the application on a user's computer, but it delivers properly formatted CDS statements as part of its GenBank format output.
To deduce the positions of the exons, BLAST2 will create a local alignment between a nucleotide sequence, the cDNA, and the genomic sequence. DIY Sequence Comparison uses WU-BLAST to compare nucleotide or protein sequence to the genomic sequence.
- Genomic sequence
- The sequence from which exons of the CDS will be parsed. Fasta format or raw sequence will be accepted. Spaces and numbers will be ignored.
Many browser configurations limit the information [thus the number of nucleotides] that can be pasted in an input box. When inputting a genomic sequence >20Kb, we recommend that you first save the genomic sequence in fasta format to your hard drive then use the "Browse" button to upload the file.
- URL from NCBI
- For some genes the genomic sequence and the exon positions of the gene are conveniently [and correctly!] provided through GenBank at NCBI. For example, from within the hyperlink for the Arabidopsis BAC F13C5 find the gene hyperlink for AT4g19020, aka the CMT3 gene from Arabidopsis. When the URL is available, use it as a shortcut to avoid pasting in the exon positions and genomic sequence.
If using an NCBI entry with more than one CDS, please copy out the CDS statement you want to use can paste it into the Exon/Intron Position Statement box.
- GenBank formatted file
- CODDLe can extract the genomic sequence and the exon positions of the CDS from a GenBank file. Use Sequin to combine genomic sequence, protein sequence and cDNA information into a GenBank formatted file. Save the file to your hard drive and use the "Browse" button to upload the file.
If using a GenBank file with more than one CDS, please copy out the CDS statement you want to use can paste it into the Exon/Intron Position Statement box.
- Genetic Code
- A user should choose the correct genetic code for their gene from the list. If a genbank entry or genbank formatted file is being used, however, the genetic code will be set automatically using the information provided there.
- Initial Amino Acid Position
- If working with a fragment of a gene, a user may enter the number of the first amino acid coded for by the fragment so that numbering is consistent with other data ssources. This is especially important if using PARSESNP to process variants in amino acid sequence. If the fragment begins with an incomplete codon, give the position of that incomplete amino acid in the protein.
- Initial Codon Position
- If working with a fragment if a gene, the user should use this field to specify the position of the first exonic nucleotide in it codon; 1 means the first nucleotide in a codon, 2 the second, and 3 the third. Failure to do this will cause a frameshift, though this will probably be detected due to the presence of a non-terminal stop codon in the translated product.
- Homology Model
- In order to evaluate the severity of missense changes, some kind of homology model is required; this can be any combination of Blocks Family, User Defined Blocks, and a Sequence Alignment.
- Block Family Identifier
- In order to evaluate missense changes, the CDS can be compared to one or more Blocks families. A user can enter multiple Block families by separating the identifiers with commas. The Blocks identifiers will be of the form IPB001234 or PR00123. To search for Blocks families for your gene, go to the Blocks web site and search by keyword, or go through the CODDLE and PARSESNP Input Preprocessor to search the Blocks database using Reverse PSI-BLAST.
- User Defined Blocks
- The user may want to create a properly formatted Blocks entry from a collection of protein sequences related to the query. BlockMaker will correctly format the user blocks.
Use this option when the protein is a member of a subfamily and blocks extracted from the subfamily better represent information about conservation of residues than do blocks from the family entry in the Blocks+ database. Blocks Tree Viewer is a useful tool for navigating the hierarchy of Blocks tree and will provide the collection of sequences in the subfamily from which a new set of more specific Blocks can be extracted.
- Sequence Alignment
- Users may also submit a sequence alignment, in ClustalX, Fasta, or MSF format. In CODDLE, this alignment will be used to make Blocks; in PARSESNP, it can also be used to make Blocks. However, if the alignment contains the translated protein sequence of the gene being examined, PARSESNP will also calculate SIFT scores for polymorphism entered that results in a missense change.
- Windows Selected
- Users may choose a single window that has the maximum scoring value for the entire coding region or choose multiple windows that cover all of the exonic coding sequence.
Graphics Format
By default the graphics output is created in PNG, or Portable Network Graphics, format. Some users of older web browsers may have problems with this, so we have provided the option to use JPEG format instead; pictures created this way tend to be larger in size and somewhat blurry, however, so PNG is the preferred format.
CODDLE Input Fields
- Sliding Window Size
- Choice of a sliding window size corresponds to the choice of amplicon size and will depend of the method of SNP discovery. For denaturing HPLC the suggested window sizes might be 300bp. For enzymatic mismatch cleavage the suggested window size might be 600bp.
- Mutation Method
- CODDLe predicts the outcome of each mutation method, by summing the [not necessarily equally weighted] contribution of each of these types of changes to each DNA nucleotide: transitions - G:C -> A:T and A:T -> G:C, transversions - A:T -> T:A, G:C -> C:G, G:C -> T:A, A:T -> C:G, and transitions at CpG dinucleotides. TILLING users should select one of the TILLING options from the bottom of the menu.
Additional mutation methods could easily be added, if there were an interest in CODDLInG after ENU mutagenesis in another organism, for example. Contact us at proWeb
- Scoring System
- By default each nonsense change and each splice junction change is given the score +1. The optimal window would have the greatest number of potential nonsense and splice juction changes.
A more elaborate scoring scheme 'Penalize Silent' may be the better choice when selecting amplicons for dHPLC. Each potential silent change is given the score -1 and nonsense and splice junction changes are given positive scores in proportion to their frequency of occurrence. The optimal window maximizes the opportunity to discover nonsense and splice junction changes, while minimizing the discovery of silent changes.
PARSESNP Input Fields
- Making Blocks from an Alignment
- In PARSESNP, users have the choice to create Blocks from a multiple alignment as part of their submission. The Blocks are created using the Blocks Multiple Alignment Processor, and have a minimum width of 10 and a maximum width of 55. If this box is not checked, the alignment will be used only in the determination of SIFT scores.
- Variants from HGMD URL
- This field allows users to process variants from an HGMD entry. Currently only mutations of the type "Nucleotide substitutions (missense / nonsense)" are supported, please submit the URL of a page containing that kind of mutation.
- Variants from SwissProt Entry
- This field allows users to process variant from a SwissProt entry. Variants are stored in SwissProt as change in protein sequence, so it's important that residue numbering be consistent. Amino acid changes that cannot be caused by a single nucleotide change will be skipped; those that can be caused by more than one single nucleotide change will be displayed more than once. Enter either the ID of a SwissProt entry, e.g. HBB_HUMAN, or its AC, e.g. P02023.
- Variants from dbSNP FASTA File
- With this field, users can upload a FASTA file of variants downloaded from dbSNP.
- Variants from Text File
- In this field, users can submit a TSV (tab-separated values) file of variants. All empty lines and lines beginning with a pound sign (#) are skipped. Each should contain three fields, separated by tabs:
- The variant, in one of the acceptable formats.
- A description of the variant.
- A mark in the third column if the change in the first column is an amino acid change (e.g. M1V). Any printing characted except for 0 in this column indicates an amino acid change. For more information on the formats for entering variants, please see the PARSESNP Introduction.
Input Preprocessor Fields
- Coding Sequence Position Information
- To work with you gene, the positions of the exons must be determined. If you already know this, you can enter this information directly into the Exon/Intron Position statment field using the format described above. GenBank entries usually contain this information, as to GenBank formatted files output from Sequin; this information is read in automatically from these sources. If none of these is available, the Input Preprocessor can attempt to determine the gene model from amino acid sequence or from cDNA sequence.
- Amino Acid Sequence
- Use this field to submit the translated product of your gene, from the start codon, coded as an M, to the stop codon, coded as a *. It will searched against the genomic sequence using a custom-written program in an attempt to determine the gene model. If abiguities or errors are found, an error message will report the program's preliminary findings, allowing the user to correct them by hand.
- cDNA Sequence
- Use this field to submit the cDNA sequence of your gene, from start codon to stop codon. The Sim4 program will be used to attempt to determine the gene model. If it runs in to problems, the sim4 output will be reported, so that the user can correct them and resubmit.
CODDLE Output
- Subsequence extracted
- When genomic sequence length is longer than the length specified by the CDS statement or when the CDS statement indicates the need to translate the reverse complement of the submitted genomic sequence, a subsequence is extracted. The position numbering on the output page and on the subsequent Primer3 pages are positions in the extracted subsequence.
- Scoring for Protein Truncation and mRNA Disruption
- At each nucleotide position, the potential nucleotide changes are calculated based on the mutation method. Those mutations which introduce a nonsense change or a change to the first two or last two positions in the intron contribute to the default scoring scheme: count truncations. Potential changes in the 'middle' of the intron do not contribute to this score [no attempt is made to identify branchpoint sequences, for example].
In the alternative scoring scheme: penalize silent, a base change resulting in a silent change is given the score -1. This is a fixed "cost" for having to identify an uninteresting base change. All other scoring is relative to the fixed silent change score of -1. The score assigned to nonsense changes is assessed by calculating how frequently nonsense changes could occur relative to silent changes. As an example, given the codon usage of Arabidopsis thaliana and the fact that G:C to A:T transitions predominate in EMS mutagenesis, silent changes in CDS should be discovered 30% of the time and nonsense changes 5%, thus silent changes are scored -1 and nonsense changes +6. Currently, the splice junction scores are based on frequency of splice sites in an average CDS; thus a splice site change is scored +4 in Arabidopsis under EMS mutagenesis.
In both scoring schemes potential missense changes receive the score 0, as a priori they should not affect mRNA stability or protein length. Their potential affect on gene product function is handled separately.
- Plot of exons and sliding window totals
- The method of mutation and the protein/gene identifier are indicated at the top of the plot.
Exons are indicated as open boxes and introns as a single line positioned on the gene sequence. Nucleotide position numbers are indicated on the x-axis at the bottom of the plot.
If Block matches are found, they are indicated above the exon/intron plot. A single letter above corresponds to the letter of the Blocks match.
Potential to discover a base change that disrupts a gene is indicated by the score on the y-axis. The greater the score, the greater the potential of discovering a disruption. A sliding window sum is calculated from the scores assigned to each base [see Scoring for Protein Truncation and mRNA Disruption]. The width of the window is user defined [see Sliding Window Size]. The score for a point on the plot is for the window centered at that position on the genomic sequence.
- Plots can be compared among genes from the same organism. Codon usage will vary within a gene and among genes. By default the high-scoring region is analyzed for each window size. In the event of a tie for high-score, the region closest to the start of the CDS is chosen.
- Scoring of Missense Changes
- Only missense changes in segments that are similar to the user-selected blocks are scored. In this scoring scheme nonsense changes are uninteresting and are treated like silent changes.
For each possible missense change in the segment, the observed amino acid and the potential missense-changed amino acid are compared to the Position Specific Scoring Matrix (PSSM) for the Block at that position. The difference is calculated between the PSSM score of the observed amino acid and the PSSM score of the possible missense change. The score for the Block is the [weighted] sum of the score differences for each possible missense change over the length of the matched Block.
The greater the Block PSSM difference score, the greater the potential of finding a deleterious missense change in that region. Longer Blocks tend to have higher scores than shorter Blocks and this is consistent with a greater potential of discovering a deleterious missense change in a region with a longer string of conserved residues. As the similarity of the gene to the Block model increases the Block PSSM difference score tends to increase.
Note that in a PSSM each row of the scoring matrix corresponds to a position in the reference amino acid alignment and provides the scores associated with substituting each one of the twenty amino acids at that position in the alignment. The transformation from the actual protein sequence alignment into a PSSM is done using the odd ratios between the amino acid frequencies observed in the multiple alignment and the frequencies expected from protein databases. The contribution each sequence makes to the PSSM is modulated using position-based sequence weights so that multiple overly similar sequences contribute less to the PSSM, while diverged proteins in the alignment contribute more.
Note that changes at the initiator ATG codon and at the true stop codon of the coding sequence are likely to have profound effects on gene function. Changes at the initiator ATG are likely to interfere with efficient translation initiation while changes at the true stop codon may allow translation of a longer and less stable protein product. These potential changes are not explicitly evaluated by CODDLe. When the user discovers that the 'best' region in the query gene is near the start or the end of the CDS, the user should select an amplicon [using the examine a section option] which would allow discovery of changes at the initiator ATG or the true stop codon.
- Table of Block Hits
- The name of the user-determined Block family is indicated at the top of the table. The score for each Block is a sum of score differences explained in Scoring of Missense Changes. Width is the amino acid residue width of each block. Position is the position of each block in the genomic (sub)sequence. The start and end positions of the subregion which has the greatest potential deleterious missense changes is also indicated.
- Examine a Section
- A user-selected region of any window size of the genomic sequence is displayed to extensively indicate the potential changes that might be discovered. The user-selected start and end positions of the region are indicated at the top of the analysis.
- Redo Analysis
- Use this feature to reevaluate a particular user-selected subregion. The subsequence is replotted and the high-scoring window in that subsequence is analyzed.
A high-scoring region of the genomic sequence is displayed to extensively indicate the potential changes that might be discovered. The start and end positions of the region are indicated at the top of the analysis.
- Analysis of a (High-Scoring) Region
- A region of the genomic sequence is displayed to extensively indicate the potential changes that might be discovered. The region may have been selected automatically as high-scoring or chosen by user. The start and end positions of the region are indicated at the top of the analysis.
The genomic sequence is indicated. Because codons are grouped as triplets followed by a space and are not split by line breaks; the number of nucleotides in each line varies. The number at the end of a line indicates the position of the last nucleotide in the genomic (sub)sequence.
In regions corresponding to exons, above the nucleotide sequence of a codon is the corresponding single letter amino acid code. The Block hits are underlined and those amino acid residues are colored to indicate their similarity to the Block. Each residue is compared to the PSSM for the Block. A residue colored green indicates a positive similarity score. A residue colored black indicates a score between 0 and -2. A residue colored red indicates a score of < -2. A Block hit with few green residues indicates a weak similarity to the Block.
Below each nucleotide is an indication of the changes that can be detected. A blank under a nucleotide indicates that it's not likely to be changed by the mutation method. In introns, splice junction changes are indicated by a pound sign (#) and silent changes by a caret (^). In exons, missense changes are indicated by single letter amino acid code, nonsense changes by an asterix (*), and silent changes by an equals sign (=). In the region of a Block hit, the missense changes are also compared to the PSSM for that Block and the residues are colored by a similar scoring scheme: green indicates a negative PSSM difference score [and a change that is not likely to be disruptive], black indicates a score between 0 and 10, red indicates a score > 10 [and a change that is likely to be deleterious].
- Create primers for this window
- This button spawns a Primer3 input window where certain fields are automatically filled in: 1) the genomic (sub)sequence, 2) the sequence ID, 3) the product size, 4) the primer size, 5) the primer Tm and 6) the included regions.
By default, Primer Tm is set with a 70° optimum and an optimal primer length of 24 nucleotides.
Once CODDLe has selected an optimal window, Primer3 is used to select forward and reverse primers at or near that window's edges. More precisely, the selected window size and the selected region determine the product size and the included region used in Primer3. Let w be the selected window size and s be the start position of the chosen region. The product size parameters for Primer3 will be set at Min:(w - w/10), Opt:(w), Max:(w + w/10). The included regions parameter will be set at Start,Width:(s-w/10),(w+w/5).
- Translated sequence
- The deduced amino acid sequence. As a check, a user can compare it to the known sequence.
- Search for Similar BLOCKS Entries
- An opportunity to compare the deduced amino acid sequence to the Blocks+ database. The resulting Blocks+ hit(s) will be useful when analyzing missense changes in the query sequence.
- cDNA sequence
- The deduced cDNA sequence. As a check, a user can compare it to the known sequence and to the EST database.
Created 1 August 2000, last modified 14 January 2003
© 2000-2003 The proWeb Project.