代写Lloyd’s Algorithm, 也就是k-means算法。
Requirement
In finance, we often want to treat things as a group. The hope is that the
grouping will reveal patterns and reduce noise. For example, we know for a
fact that the returns of 7,000 stocks do not represent 7,000 orthogonal
factors. The number of factors driving stock returns is much, much smaller
than that, though the exact number of real factors is a point of some
controversy.
Techniques such as principal components analysis - which is frequently used to
attempt to extract the factors driving stock prices - are beyond the scope of
this class. You will spend some time working with PCA and other approaches
later in the math finance program. In this class, we want to focus on problems
that exercise your programming muscles.
Fortunately, there is a computationally interesting yet relatively simple
clustering technique that we can use to demonstrate the problem of grouping
things. It’s called Lloyd’s Algorithm or the K-means algorithm. The following
is a graphical demonstration of the standard Lloyd’s Algorithm at work.
While the above clustering problem is shown in two dimensions, any number of
dimensions can be used. We can compute distances in two dimensions, or any
number of dimensions.
If we were doing this with a time series of stock returns, each return would
represent a value in one dimension, and each time series would be a point in n
dimensions, where n is the number of returns in that series.
But we’re going to leave this bigger problem alone and focus on a much smaller
problem, a toy problem. As explained in class, though the solution of the toy
problem will be well known to us, it will not be easy to get a computer to
solve it.
The Toy Problem
Suppose a city 99 blocks wide by 99 blocks long. A person stands on every
corner, 10,000 people in all. In other words, there’s a person at {0,0}, a
person at {0,1}, and so on. There is a person at {99,99}, the top right corner
of the map, a person at {0,99}, the top left corner of the map, and a person
at {99,0}, the bottom right corner of the map. 10,000 people in all, evenly
distributed throughout our toy city. (By the way, this area is about twice the
size of Manhattan.)
The problem we want to solve is this: we want to group these people into
clusters of 20, or 500 clusters in all. The solution to this problem is a list
of the centers of these clusters, their geometric centroids.
This problem is called NP hard. It’s a highly combinatorial problem and brute
force solutions are going to fail miserably. For example, suppose we were
finding just two cluster centroids. There are 100 possible x coordinates and
100 possible y coordinates for each of two cluster centroids, or 100 x 100 x
100 x 100 = 100 million possible ways to place the centroids. But we are not
working with two centroids. We are working with 500. Clearly, a brute force
search just won’t work.
Intuitively, we know the solution: we would put one cluster centroid at the
center of every square with the following width and height in blocks.
However, to actually do this, we will use Lloyd’s algorithm. The solution will
not be exact. Lloyd’s Algorithm is a heuristic which will converge to
solutions that may not be globally optimal. However, it will get us pretty
close.
The Objectives
Any algorithm that can be graphically described in 4 slides (as above) is
going to be trivial to implement, so the solution - while it should be correct
- is not the main objective of this homework. The main objective of this
homework is to apply the themes and strategies that were explained to you in
the last lecture.- Your solution should be elegant in its object-oriented design and should lend itself to easy and thorough unit testing using Junit.
- This testing is part of the deliverables of your homework assignment.
- You should build your classes to be reusable. Though the problem is simple, there are definitely reusable components that are part of the solution.
- You should come up with a metric that allows you to measure the quality of the solution as well as the quality of the initial state. Your code should be readable and well documented.
The Twist
As you can see from the graphical representation of Lloyd’s Algorithm, it does
not group points in clusters of fixed size. But what if we want that? What if
we demand that each cluster has 20 people? The problem becomes much more
difficult. To see how much more difficult, you will present two solutions. One
solution will use the standard Lloyd’s approach - points seeking the nearest
cluster - while the second solution will use a modified approach - clusters
seeking the nearest 20 points.
In other words, in Lloyd’s Algorithm, you will iterate over every point - each
representing a person - and, for that point, will find the nearest cluster
centroid. In the modified approach, you will iterate over cluster centroids
and, for each, will find the nearest 20 points.
If you write your code correctly and use all of the principles of good object
oriented design, this modification should be easy and should not require
rewriting much of your code.