Java代写：CS106SequenceAlignment

发布日期: 2020-03-03

DNA序列处理相关作业，练习Java数组的基本用法。

Requirement

This assignment is to be done either individually or in pairs. Do not show
your code to any other group and do not look at any other group’s code. Do not
put your code in a public directory or otherwise make it public. However, you
may get help from the TAs or the instructor. You are encouraged to use the
RPILMS Discussions page to post problems so that other students can also
answer/see the answers.

Sequence Alignment

How to determine the similarity between two DNA sequences, for example,
ACATCTCTC and AGTTTC? We can align them together and accumulate the
differences between pairs of characters. For example, the following is an
alignment of the two sequences.
A C A T C T C T C
A G T T T C - - -
0 1 1 0 1 1 2 2 2
The gap character ‘-‘ is used to make the sequences equal in length. The
difference between the same characters is 0. The difference between different
characters is 1, and the difference between a character and a gap is 2. The
difference of the alignment is the sum of the character distances:
0+1+1+0+1+1+2+2+2 = 10. Two sequences can be aligned in many ways. The
following is another alignment of the sequences whose difference is
0+1+2+0+1+0+0+2+2 = 8.
A C A T C T C T C
A G - T T T C - -
0 1 2 0 1 0 0 2 2
Define the distance of two sequences as the minimum difference of an alignment
of them, which is a good indication of their similarity. We can compute the
distance using dynamic programming. Given two sequences s1 = a1,a2,a3,…,am and
s2 = b1,b2,b3,…,bn, define distance(i,j) (0≤i≤m, 0≤j≤n) as the distance
between sub-sequences a1,a2,…,ai and b1,b2,…,bj. Then distance(i,0) = 2i and
distance(0,j) = 2j, when one sub-sequence is empty. distance(m,n) is the
distance between s1 and s2. In an optimal (minimum difference) alignment of
sub-sequences a1,a2,…,ai and b1,b2,…,bj, there are three possible cases:

ai is aligned with a gap, and the sub-alignment of a1,a2,…,ai-1 and b1,b2,…,bj is optimal. distance(i,j) = distance(i-1,j) + 2.
bj is aligned with a gap, and the sub-alignment of a1,a2,…,ai and b1,b2,…,bj-1 is optimal. distance(i,j) = distance(i,j-1) + 2.
ai is aligned with bj, and the sub-alignment of a1,a2,…,ai-1 and b1,b2,…,bj-1 is optimal. distance(i,j) = distance(i-1,j-1) + (ai==bj ? 0 : 1).
Which indicate that distance(i,j) = min(distance(i-1,j) + 2, distance(i,j-1) +
2, distance(i-1,j-1) + (ai==bj ? 0 : 1)). We can store all the distances in an
(m+1)?n+1) matrix M, distance(i,j) = M[i][j], and compute M[m][n] using the
following pseudocode.
M[i][0] = 2 * i
M[0][j] = 2 * j
for i = 1 to m
for j = 1 to n
M[i][j] = min(M[i-1][j] + 2, M[i][j-1] + 2, M[i-1][j-1] + (ai==bj ? 0 : 1))
endfor
endfor
—|—

Part 1 Concurrent programming.

First, read the comprehensive example of SALSA.
You are given a file sequences.txt containing N (= 100) random DNA sequences,
each sequence on a line. Your task is to find the three most similar pairs of
sequences. You can assume that the length of a sequence does not exceed 100.
Write a concurrent sequence alignment program in SALSA. There should be M
worker actors and one coordinator actor. The worker actors compute the
distances of pairs of sequences and the coordinator actor gives the pairs to
the workers and merges the results. Each worker actor is responsible for
N(N-1)/2M pairs of sequences. All the actors run locally in a single theater
in Part 1. Your program is required to take two arguments: sequences.txt and
the number of worker actors. The command to execute the program should look
like the following:
$ java /* your program / sequences.txt 10
Where / your program */ is the name of your program and 10 is the number of
worker actors. Your coordinator actor should output the top pairs and the
total running time:
sequences: 100, workers: 10, theaters: 1
most similar pairs:
distance(GGGACGATAAAGAAGGAGATTGGATTGCCAAAGCAACATTGTGAGCAAATAAACAGGCATATGGTCATACG
GTCGCATCTCCATGGGATG, TGCCCCCTAGAAAAGGGATATACTGAAACGAAACTATCATGCTACCGAACTCTCTAGGG
ACCATGGTCTAGGCAGCCTGTGTCATATATTT) = 52
distance(AAGTGCCAGTATATCCTACAGAGCGACATGGACCAGAAGCTGACGGCCTCGCTATAAAACATTATCATGTC
CAGATCTCGCGAAGATGCGACTGC, GATGCGAACAACCGTTAGCCACGTTCTGGAACATATACCGAGCGCAGGCCAACC
ATTCGGCAATTCGCCCGGAGCTGCCCAAAAAGCGATGG) = 53
distance(GATGCGAACAACCGTTAGCCACGTTCTGGAACATATACCGAGCGCAGGCCAACCATTCGGCAATTCGCCCG
GAGCTGCCCAAAAAGCGATGG, GTGGCGTAAACCCGTTTGGATGCGAGGCCTGAAGGCCCTTTGTCGCACAACCCCTAG
ACACGTGTCTGCACGAGACGGCCGGAATAGCTAGT) = 53
total running time: ….ms

Part 2 Distributed programming.

Write a distributed version of the program in SALSA based on Part 1. That is,
you should use multiple theaters and distribute worker actors on each theater.
In addition to the arguments used in Part 1, your program must accept another
argument to specify theaters and a nameserver as follows:
$ java /* your program */ sequences.txt 10 theaters.txt
The theaters.txt is a text file, the first line of which is the location of
the nameserver and the rest are locations of theaters. An example of
theaters.txt is here.
Grading: The assignment will be graded mostly on correctness, but code clarity
/ readability will also be a factor (comment, comment, comment!). See the
professor or TAs, if you have ideas for other extensions for this assignment
and would like extra credit for implementing them.
Submission Requirements: Please submit a ZIP file with your code, including a
README file. In the README file, place the names of each group member (up to
two). Your README file should also have a list of specific features / bugs in
your solution. Your ZIP file should be named with your RPILMS user name(s) as
the filename, either userid1.zip or userid1_userid2.zip. Only submit one
assignment per pair via RPILMS.