Write a program that does the following, for the same genome sequence you used in assignment 5:
Compute a site weight matrix using the frequency table for the translation start sites you generated in HW5, together with the genome nucleotide frequencies you found in HW1. Entries in the weight matrix should be the log, to the base 2, of
the ratio of the appropriate frequencies. Use -99.0 as the weight for any cells that have frequency 0 in the translation start sites.
Simulate a random sequence that has the same nucleotide frequencies, and the same length, as the original genome sequence. It is OK to assume independence for this (i.e. each successive nucleotide in the sequence can be chosen independently of the preceding nucleotides). Compute the nucleotide counts for the simulated sequence to verify that the frequencies are what you expected.
Using the weight matrix from (1), generate three score histograms (using a bin size of 1 for the scores):
a histogram of the scores of all "true" translation start sites (i.e. the ones used in assignment 5 to construct the site frequency table)
a histogram of the scores of all positions in the actual genome sequence
a histogram of the scores of all positions in the simulated genome sequence
Your output should provide
the name of the organism
the weight matrix (give values to three decimal places)
the nucleotide counts for the simulated sequence
the three histograms. Present these in the following form:
where each row gives the score value x, followed by the number of times a score >= x but < x+1 was observed.
Email the output above to me and Joe. Please make it as compact
as possible. Do NOT send the code itself. Include the output in the
body of your email message (as plain text), NOT as an attachment.