这个作业比较有趣,看上去是机器学习算法,但实际上只需要实现Graph的数据结构。
Background
Clustering involves grouping a collection of objects (documents, photographs,
customers, animal species) into coherent groups called clusters so that
objects in the same group are more similar, while objects in different groups
are less similar to one another. In order to do this, we need a measure of
similarity or “distance” between objects, which can be defined depending on
the type of objects. If the objects are points in the physical world (for
example, touristic sites in a map), the distance may be the physical distance
between the points. In other problems, “distance” is defined more abstractly.
“Distance between two species may be taken as the number of years since they
diverged in the course of evolution; we could define distance between two
images in a video stream as the number of corresponding pixels at which their
intensity values differ by at least some threshold.” (Kleinberg and Tardos,
Algorithm Design, 2005). In data mining, data can be described by some main
attributes, and distance can be measured by the ratio between the number of
attributes that do not match over the total number of attributes (simple
matching) or more complex definitions of distance can be used. The clusters
can help identify structure in data to learn something about it. There are
different approaches to clustering. We will explore clustering by maximizing
the space of objects in different clusters, which can be efficiently solved
using minimum spanning tree algorithms. In this assignment you will have
programming practice with graph ADT, Kruskal’s minimum spanning tree algorithm
and union-find data structure.
Clustering of maximum spacing and minimum spanning trees MST
W describe a clustering approach based on maximum spacing (Kleinberg&Tardos,
2005):
Given a set U of n objects, labeled p1, p2, ..., p_n
. For each pair pi
and pj, we have a numerical distance d(pi,pj)
. We only need that this
distance satisfies d(pi,pi)=0, d(pi,pj)>0
for distinct pi and pj, and that
distances are symmetric: d(pi,pj)=d(pj,pi)
.
Given a parameter k, we want to divide the objects in U into k groups. We say
that a k-clustering of U is a partition of U into k nonempty subsets C1, C2, ..., Ck
. We define spacing of a k-clustering to be the minimum distance
between any pair of points lying in different clusters. We want to find a
k-clustering of maximum possible spacing.
Finding a k-clustering of maximum spacing can be accomplished by a greedy
algorithm that works like running the steps of Kruskal’s algorithms until the
spanning forest generated has exactly k components. See the book by Kleiberg
and Tardos (Algorithm Design, 2005, Chapter 4.7) for a proof that this
algorithm solves the maximum spacing k-clustering problem. The following
example shows the result of applying 3-clustering on the example from the
previous page.
Imagine a graph where the vertices correspond to the points and there is an
edge between any two vertices where the weight of the edge is the Euclidean
distance between the points corresponding to the vertices that are endpoints
of the edge. In the example of Figure 2, there are 37 vertices and 37*36/2=666
edges. The first edge added to the spanning tree is marked in the picture as
it joins the vertices that are closest to each other. After adding the first
34 edges in Kruskal’s algorithm we will have 3 clusters. In general, for a
connected graph with n vertices, after adding the first n-k edges we have k
connected components, which will form the k clusters for the k-clustering
problem.
Implementation details and what you must do
You will implement the following classes, whose skeleton has been provided.
MyGraph : a graph using the Adjacency Matrix representation of a graph. It
implements a simplified version of the textbook interface Graph, where we
removed the graph update methods. Explanations are included in MyGraph.java
and you must implement all the operations indicated. Testing must be done
specifically for this class (see example MyGraphTest.java for a sample). Note
that we are using the Adjacency Matrix data structure since it is a simple
structure that is good enough for dense graphs. Indeed, the graphs for the
clustering application (Part 3) are complete graphs so every degree is n-1;
methods like incidentEdges(v) that run in O(degree(v)) for the adjacency lists
data structure would run in O(n) for this type of graphs, so there would be
little advantage in using adjacency lists.
Partition: a data structure that stores clusters of objects of type T and
implements the union-find ADT via linked lists. Initially each object is it is
own cluster. The operation union(x,y) merges the clusters of x and y and the
operation find(x) returns the cluster that object x belongs to. Each cluster
is represented by a Dnode in a doubly linked list of clusters with dummy head
and tail; each Dnode (cluster) links to a singly linked list of Node (each
node containing an element in the cluster). The Diagram below depicts this
data structure representing two clusters{X,A,C} and {Y,B}. This should follow
a similar idea as the sequence implementation of the Partition ADT in page 672
of the textbook. Operations makeCluster and find run in O(1) while operation
union runs in O(min) where min is the size of the smallest of the two clusters
being merged.
KruskalAlgorithms: the main method for you to implement is public LinkedList
kruskalClusters (Graph g, int k, Partition clusters)
It receives a graph g and the desired number of clusters k and computes a
partition of the vertex set storing the vertices in each cluster and returns
the edges on the spanning forest that connects vertices in each cluster.
This method runs some steps of Kruskal’s algorithm until the spanning forest
being created has exactly k components.
If g is the graph with 37 vertices and 666 edges corresponding to the points
in the plane in Figure 2, and if k=3, then clusters will represent the 3 sets
of points circled in Figure 2 and the method will return 34 edges that were
added to the spanning forest.
If g is a graph and k=1, then clusters will represent a set with all the
vertices and the method will return a minimum spanning tree (which by the way
for the example it would have 36 edges).
A pseudocode of this method, based on Kruskal’s pseudocode in the textbook, is
given next.
Algorithm KruskalClusters(G,k)
Input: a simple connected weighted graph G with n vertices and m edges
Output: a spanning forest T equivalent to deleting the last (k-1) edges from a MST of G and a partition P that is a solution to the k-cluster problem.
Consider P a partition of vertices (see the partition ADT in section 14.7.1).
for each vertex v in G do
P.makeCluster(v) // this defines an elementary cluster C(v)={v}.
Initialize a priority queue PQ to contain all edges in G, using the weights as keys.
T = emptyset of edges
while T has fewer than (n-k) edges do
(u,v)=edge returned by PQ.removeMin()
C(u)=P.find(u); C(v)=P.find(v); //find the cluster containing u and the one containing v
if C(u) ≠ C(v) then
Add edge (u,v) to T.
P.union(C(u),C(v)) // Merge C(u) and C(v) into one cluster
return forest edges T and partition P
—|—
Assignment Tasks and Mark Breakdown
- PART 1) Implement the missing methods for class MyGraph which implements the interface Graph where the Integer is a vertex index (0..n-1) and the Double is the edge weight (distance). A skeleton has been provided.
Thoroughly test each method of this class for various types of graphs.
Basic test provided: MyGraphTest1.java - PART 2) Implement the methods find and union of class Partition.
Thoroughly test each method of this class for various examples.
Basic test provided: PartitionTest2.java - PART 3) Implement the method kruskalClusters following the given skeleton and the explanations and pseudocode in section 2.
Thoroughly test each method of this class with different graphs.
Use tests for the k-clustering application.
Basic test provided: MyGraphTest123.java, Clusters2DTest123.java
The remaining 25 marks will be distributed for results of more complete
testing within the 3 parts above, or for any rebalancing of marks we may
decide afterwards.