- In this assignment, you will create several randomized data sets
corresponding to the data set from assignment 3, and then run the
maximal D-segment algorithm on them.
- First, write a program that simulates a sequence of read
starts, at the same number of genomic positions as in the original
data. The distribution of read start counts (the number of
positions with 0, 1, 2, and >=3 read starts) should be, on
average, the same as in the original data. The following
pseudocode demonstrates one approach to creating this
sequence:
N = number of sites in original sequence counts[r] = number of sites with r read starts in original sequence for each site 1..N x = random number between 0 and 1 (uniform distribution) if x < counts[0] / N randomized_counts[site] = 0 else if x < (counts[0] + counts[1]) / N randomized_counts[site] = 1 else if x < (counts[0] + counts[1] + counts[2]) / N randomized_counts[site] = 2 else randomized_counts[site] = 3

This randomization tends to eliminate the clustering of read starts due to copy number variation. Note, however, that we are still preserving the distribution of read start counts. As a result, this approach is expected to be more conservative than just randomly locating read starts across the sequence. The reason for doing things in this way is to allow for the fact that factors other than CNVs, such as library amplification, can also cause clustering of read starts at a particular site. - Run your maximal D-segment algorithm on 10 different randomizations of the read start sequence with D = -5. Identify segments using thresholds S = 5, 10, 15, 20, and 25,
- Run your maximal D-segment algorithm on the 'real data' sequence of read starts used in assignment 3 with the same value of D. Identify segments using the same set of thresholds.
- Generate two tables, one for the simulations and one for
the real data, with one row for each value of S, with the
following information:
- S
- the average number of D-segments found (N_seg(S)) (note that this is the number of segments with score
*at least*S,**not***exactly*S) - the minimum score of all D-segments found
- the maximum score of all D-segments found
- the ratio of N_seg(S) / N_seg(previous S)

- Karlin-Altschul theory predicts that the number of D-segments with scores at least S should be proportional to exp(-lambda S), where lambda is the scaling factor to convert scores to log likelihood ratios. Since we defined scores as basically LLRs to base 2, lambda should be the factor to convert from base 2 logs to natural logs---i.e. lambda should be the natural log of 2. So the ratio of the values for the rows corresponding to S1 and S2 in the table of counts should be approximately equal to exp(-lambda (S2 - S1)) for the above lambda. Does this appear to be true for the simulated data? Is it true for the real data? Would you expect it to be true for the real data? What score threshold is a reasonable one to use for the real data, to ensure a very low false positive rate? Answer these questions after the tables, as shown in the template file.

- First, write a program that simulates a sequence of read
starts, at the same number of genomic positions as in the original
data. The distribution of read start counts (the number of
positions with 0, 1, 2, and >=3 read starts) should be, on
average, the same as in the original data. The following
pseudocode demonstrates one approach to creating this
sequence:
- You must turn in your results
*and*your computer program. Please put everything into ONE plain text file - do not send an archive of files or a tar file, or a word processing document file. Compress it (using either Unix compress, or gzip -- if you don't have access to either of these programs let us know), and send it as an attachment to both Phil and Alex.