实现分子计算器,练习Java三方类库的使用。
Requirement
This is the last practical exercise and will continue over the remaining weeks
of the course.
In this practical you will implement a real molecular similarity method
Ultrafast shape recognition to search compound databases for similar molecular
shapes
So this problem involves reading from a file one reference molecule
calculating a descriptor for it, then reading a series of molecules from a
second file, computing the descriptor for each molecule and then quantifying
the difference between it and the reference. At the end of the run the program
should report the closest molecule and the magnitude of its difference to the
reference. All files will be in SD format and hydrogens should be completely
ignored in the procedure
The descriptor we will calculate consists of 4 triples of numbers. Each triple
consists of 3 statistical measures of distances from a point.
The measures are
- The mean distance from the point (sum of all distances divided by number of distances)
- The variance of this distance (sum of the squares of distances - mean all divided by number of distances minus 1)
- The skew of this distance (sum of the cubes of (distances - mean) / standard dev all divided by number of distances. The standard deviation is the square root of the variance.
The four points we use to calculate these from are - The centre of gravity
- the closest atom position to the COG
- The furthest atom position from the COG
- The furthest atom position from point 3 above.
To calculate the difference between any 12 double set and another simply do
the equivalent of a distance calculation but over all 12 numbers.
Remember we know how to read SDfiles from a previous practical, however here
is a reminder
In order to access the CDK library you will need some import statements
import org.openscience.cdk.CDKConstants;
import org.openscience.cdk.Molecule;
import org.openscience.cdk.DefaultChemObjectBuilder;
import org.openscience.cdk.io.iterator.IteratingMDLReader;
import org.openscience.cdk.io.MDLWriter;
—|—
import org.openscience.cdk.interfaces.*;
To read a single SD file you could use something like
IteratingMDLReader MDLReader = new IteratingMDLReader(new FileInputStream(RefFile), DefaultChemObjectBuilder.getInstance());
if (MDLReader.hasNext()) {
mymol = (Molecule)MDLReader.next();
}
—|—
To read a sequence of files from an SD file
MDLReader = new IteratingMDLReader(new FileInputStream(ScrFile), DefaultChemObjectBuilder.getInstance());
while (MDLReader.hasNext()) {
mymol = (Molecule)MDLReader.next();
}
MDLReader.close();
—|—
To get the name of a Molecule (here called m1) object
Name = new String(String.valueOf(m1.getProperty(CDKConstants.TITLE)));
—|—
To get its number of atoms
int natoms = m1.getAtomCount();
—|—
you can get each atom in a molecule by
IAtom myatom = m1.getAtom(i);
—|—
Where i is the ith atom
You can get the chemical symbol from each atom
String s1 = myatom.getSymbol();
—|—
You can get the coordinates as a Point3d object by
Point3d mypoint = myatom.getPoint3d();
—|—
(to use Point3d class you have to importjavax.vecmath.Point3d
)
The Point3d class has a method called distance which returns the distance
between the instance calling and its argument so
Point3d a,b;
…
d = a.distance(b);
—|—
In addition to the usual criteria of Functionality, readability, comments and
a readme file, I request that you prepare a document called plan.txt in which
you write a simple logic plan for the program.
In order that you don’t get bogged down in the statistics I have given you a
set of example methods to calculate mean, variance and skew.