- In this assignment, you will create several randomized data sets
corresponding to the data set from HW3, and then run the
maximal D-segment algorithm on them.
- First, write a program that simulates a sequence of read
starts, at the same number of genomic positions as in the original
data. The distribution of read start counts (the number of
positions with 0, 1, 2, and >=3 read starts) should be, on
average, the same as in the original data. The following
pseudocode demonstrates one approach to creating this
sequence:
N = number of sites in original sequence counts[r] = number of sites with r read starts in original sequence for each site 1...N x = random number between 0 and 1 (uniform distribution) if x < counts[0] / N randomized_counts[site] = 0 else if x < (counts[0] + counts[1]) / N randomized_counts[site] = 1 else if x < (counts[0] + counts[1] + counts[2]) / N randomized_counts[site] = 2 else randomized_counts[site] = 3

This randomization tends to eliminate the clustering of read starts due to copy number variation. Note, however, that we are still preserving the distribution of read start counts. As a result, this approach is expected to be more conservative than just randomly locating read starts across the sequence. The reason for doing things in this way is to allow for the fact that factors other than CNVs, such as library amplification, can also cause clustering of read starts at a particular site. - Run your maximal D-segment algorithm on 10 different
randomizations of the read start sequence with D =
-5. Identify segments using thresholds S = 5, 10, 15, 20, and
25. Use the following read-start scoring scheme (different from HW3!!):
- score for 0 reads: -0.96997
- score for 1 read: 0.12729
- score for 2 reads: 0.94169
- score for >=3 reads: 1.73806

- Run your maximal D-segment algorithm on the 'real data' sequence of read starts used in assignment 3 with the same value of D, and the scoring scheme described above. Identify segments using the same set of thresholds.
- Generate two tables, one for the simulations and one for
the real data. The tables should have one row for each value of S, with the
following information:
- S-value
- N_seg(S), the average number of D-segments found for this threshold of S (note that this is the number of segments with score
*at least*S,**not***exactly*S) - the minimum score of all D-segments found
- the maximum score of all D-segments found
- the ratio of N_seg(S) / N_seg(previous S)

The last column is the ratio of the average number of segments found for the current row's S, to the average number of segments found with the previous row's S. Note that the averaging is only important for the simulation table, since you will only need to run the algorithm once on the real data. All values should be given to

**two decimal places**. If there is no value to print (as in the last column for the first row, or for cases where there are no D-segments and thus no minimum and maximum scores), print -1. See the template file for other formatting details. - Karlin-Altschul theory predicts that the number of D-segments with scores at least S should be proportional to exp(-lambda S), where lambda is the scaling factor to convert scores to log likelihood ratios. Since we defined scores as basically LLRs to base 2, lambda should be the factor to convert from base 2 logs to natural logs -- i.e. lambda should be the natural log of 2. So if N_seg(S1) is the number of D-segments found with score threshold S1, and N_seg(S2) is the number of D-segments found with score threshold S2, then the ratio N_seg(S2)/N_seg(S1) should be approximately equal to exp(-lambda (S2 - S1)) for the above lambda (i.e. lambda = ln(2)).
- Consider the following questions:
- Does this relationship appear to be true for the simulated data?
- Is it true for the real data?
- Would you expect it to be true for the real data?
- What score threshold is a reasonable one to use for the real data, to ensure a very low false positive rate?

- First, write a program that simulates a sequence of read
starts, at the same number of genomic positions as in the original
data. The distribution of read start counts (the number of
positions with 0, 1, 2, and >=3 read starts) should be, on
average, the same as in the original data. The following
pseudocode demonstrates one approach to creating this
sequence:
- You must turn in your results
*and*your computer program. Please put everything into ONE plain text file - do not send an archive of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs let us know), and send it as an attachment to both Phil and Serena.