MBT 599 Homework Assignment 8
Due Friday Mar. 9
- Write a program that does the following, for the same prokaryote genome sequence you have been using in earlier assignments:
- Compute a codon score matrix, as follows:
- Find the codon frequencies in the annotated coding
sequences for the organism (the .ffn file). (In assignment 2, you
found the codon counts; convert these to frequencies).
- Find the
frequencies of nucleotide triplets in the entire genome and its complement; these will be taken as the "background frequencies".
- The integer score of a codon (in half-bits) is defined to be
twice the log (base 2) of (codon frequency/triplet frequency), rounded
to the nearest integer. For the stop codons, use a score of -50.
- Find high-scoring candidate coding segments in the genome as follows:
- For each of the 6 reading frames (3 reading frames on each strand x 2 strands), divide the sequence into non-overlapping triplets in that frame, and attach a score to each triplet using the score matrix you created above. Then find the highest-scoring segment, using the same type of algorithm you used in HW 4, except now applied to triplets rather than single nucleotides. After finding the highest scoring segment in a given frame, mask it out and search the remainder of the sequence (in the same frame) to find the next highest scoring segment, etc. Stop the search in each frame when you have found all segments of score at least 25.
- Find a best set of non-overlapping "coding" segments, as follows (similar to HW2 except ranking segments using scores rather than ORF length): Sort the list of all segments you found (in all 6 reading frames) in decreasing score order. Work through the list, discarding any segment that overlaps a higher scoring segment (in any frame) and keeping the remainder.
- Generate a histogram of the scores of all of the remaining segments; and a histogram of their lengths (number of codons).
- Your output should include the following:
- Name of organism.
- Codon frequency table (in the usual format for codon tables, i.e. with TTT in upper left hand corner).
- Nucleotide triplet frequency table (same format as 2).
- Codon score table (same format as 2).
- Histograms of segment scores, and of segment lengths.
- Email the output above to me and Joe. Please make it as compact
as possible. Do NOT send the code itself. Include the output in the
body of your email message (as plain text), NOT as an attachment.