代写C程序,处理DNA序列文件,裁剪序列。
Requirement
Write a program that takes a FASTA file of DNA sequences as input. It then
counts the frequencies of every 3-mer sequences.
Read FASTA file format from Wikipedia for a history of FASTA. Since there is
no official definition of FASTA format, here we use the following customized
definition:
- The FASTA file must be an ASCII text file with one or multiple lines.
- The FASTA file may contain one or multiple entries of DNA, RNA, or protein sequences.
- Each entry (except for the first entry) must start with a header line that starts with a greater sign, and followed by one or more sequence lines. The concatenation of the sequence lines provides the sequence of one entry.
- The first entry can either have or have not a header line.
- Whitespace characters can appear anywhere in the file. If the whitespaces are at the beginning of a header line, or anywhere in the sequence lines, they are to be discarded by the reading programs. For example, a line “ A CG T” will be the same as “ACGT”. And a line “ ]Sample sequence” will be the same as “]Sample sequence”.
- Empty lines are to be discarded.
- All lines that start with a semicolon (after removing the whitespaces) are comment lines and are to be discarded.
- For the purpose of this assignment, lower case letters a, c, g, and t are treated as the same as upper case letters A, C, G, and T.
- For the purpose of this assignment, any characters other than the upper and lower case A, C, G, and T in the sequences are not counted towards the 3-mers. For example, “ACCXGGTTT” will have only four 3-mers: ACC, GGT, and GTT, and TTT.
- Sample FASTA files can be downloaded. The yeast.nt.gz there is a good start. Explore the internet to find more data sources.
The input is a FASTA file. The output of your program is to the standard out
in the following format:
AAA 0.0510
AAC 0.0111
Each line is two columns separated by a single whitespace. The first column is
the three letter, and the second column is the frequency. Use rounding to keep
four digits after the decimal points. Lines must be ordered alphabetically.
Things you can and should assume (in developing program and test cases): - Input file size does not exceed 100M bytes. (This isn’t true in reality, but just assumed for sake of simplicity.)
- The file sizes in the test cases you submit should not exceed 10K bytes. However, the TA will also create test cases with larger files to test your program.
- The file has at least one non-empty sequence.
- If a 3-mer has a frequency of 0, it is still included in the output and the frequency is 0.0000.
What to submit?
You should login to your student.cs account and submit your assignment
electronically. You are encouraged to submit multiple draft versions earlier
than the deadline to ensure the final submission is smooth. Only your latest
submission before the deadline will be marked. Your submission should be in a
directory a1/ that contains the following subdirectories and files:
- A subdirectory src/.
All your source codes should be in this directory. Do not include the compiled
files with the source code. - A make file “makefile”.
We will use the command “make makefile” under the directory assn1/ to compile
your program. This should compile your program and generate either an
executable or a jar file in an out/ directory under assn1/. Note that out
directory should not exist in your submission, and is automated generated by
make. - A shell script “run.sh” to run the program.
After make, the “./run.sh inputfile” command should run your program and print
results to the standard output. - A test cases directory test/.
* a. 10 test cases. Each case includes two files x.in and x.out, where x is the number from 0 to 9.
* b. A readme file readme.txt explaining what problem each test case was designed to test against. When your test case is strangely behaved, such as failing everybody’s program, the TA might need to look into the readme file to determine if that is a valid test case.
How do we mark?
We will use a script to
- cd assn1/
- make
- ./run.sh test/0.in > /tmp/tmpfile
- Then we’ll check if tmpfile’s result is the same as test/0.out.
- Step 3 and 4 are repeated for each of the test cases, including yours, the TA’s, and other students’ test cases.
The marks will be based on the test cases and the program. Different test
cases may demonstrate different patterns in failing programs. The following
explains this by showing one example with 6 test cases and 3 programs.
Although there are 6 test cases, there are two pairs that have the same
patterns: 2=5 and 1=4. So there are only 4 different test classes, with
patterns PPP, PFF, FPP, and FFP, respectively.
Suppose your test cases covers different test classes and fails different
students’ programs.
Moreover, your program passes test classes. Additionally, the TA determines
that of your test cases are invalid cases.
So, to maximize your marks, try to develop different and valid test cases that
fail other students’ programs, meanwhile, try to develop a bug-free program
that passes all the test cases.
Additional things you can play with
- There is a tendency that species that are more evolutionarily apart from each other have larger differences in their k-mer frequencies. Try a few species (such as human, mouse, fruit fly, and yeast), and see if this is indeed true.
- Try to do a k-mer frequency for k>3 on a PC. For example, k=20. What problem will this create? How do you solve it?
These additional things will not be marked but just to make you feel good
while programming for the assignment.