代写一个分类器框架,实现分类器模型,并对数据进行处理。
Before Dive in
- Programming assignments take more time, and it counts more in your final grade, so please start early.
- This is an individual assignment. You can discuss this assignment in Piazza, including the performance your implementation can achieve on the provided dataset, but please do not work together or share codes/parameters.
- Related libraries or programs can be found on-line, but you are prohibited to use these resources directly. The purpose of this programming assignment is to learn how to build a classification framework step by step.
- You will need to make decisions about parameter values when implementing these algorithms (e.g., depth of decision tree, number of trees in random forest, etc.). Please do not ask or share parameter settings on Piazza. You need to tune the parameters and find the best setting for each dataset yourself. You need to briefly describe why and how you find these parameter settings in your final report.
- You can use C/C++ or Java or Python2/3 as your programming language (No, you cannot use Matlab or C#. And no, Scala is not Java). Your programs should be able to compile correctly in EWS Linux environment. This assignment will be graded automatically using an auto-grader so please make sure you follow all the input and output definitions strictly. The auto-grader will try to compile your source code and check the output of your programs. The auto-grader also has plagiarism detection feature, so please do not copy other people’s programs or use modified version of existing resources. A more detailed project organization guidance can be found at the end of the assignment.
- A section titled Frequently Asked Questions can be found at the end of the assignment, and we will keep updating it when more questions emerge on Piazza.
The Big Picture
In this assignment, we will build a general purpose classification framework
from scratch. This framework contains two classification methods: a basic
method and an ensemble method. More specifically, decision tree and random
forest. Given certain training dataset following a specific data format, your
classification framework should be able to generate a classifier, and use this
classifier to assign labels to unseen test data instances.
We will then test our classification framework with several datasets using
both the basic method and the ensemble method, and evaluate the quality of the
classifiers with different metrics.
In order to build this classification framework, we need to finish four steps
as follows.
Step 1
Read in training dataset and test dataset, and store them in memory.
Step 2
Implement a basic classification method, which includes both training process
and test process. Given a training dataset, the classification method should
be able to construct a classifier correctly and efficiently. Given a test
dataset, the classification method should be able to predict the label of the
unseen test instances with the aforementioned classifier.
Step 3
Implement an ensemble method using the basic classification method in Step 2.
The ensemble classification method should also be able to train a classifier
and predict labels for unseen data instances.
Step 4
Test both the basic classification method and the ensemble method with
different datasets. Evaluate their performances and write a report.
We introduced many classification methods in the lecture. However in this
assignment, we pick the following basic classifier and ensemble classifier.
- Decision Tree (gini-index) as basic method, Random Forest as the ensemble version of the classification method.
Note 1
Other classifiers can also fit into this basic-ensemble framework. For
example, Naive Bayes as basic method, and AdaBoost as ensemble method. We do
not cover them in this assignment.
Note 2
Different Random Forest methods can be found online (we introduced two in the
lecture notes). You can choose either one but Forest-RI might be easier to
implement.
Note 3
Many off-the-shelf ML tools can be found online for our classification tasks,
e.g., sk-learn. Since the purpose of this assignment is to go through the
basic classifier and ensemble classifier step by step, it is not allowed to
use any third-party library in your code.
Note 4
The classification tasks we are handling here can be either binary
classification or multi-class classification. Please make your code robust
enough to handle both cases.
Step 1: Data I/O and Data Format
The classification framework should be able to parse training and test files,
load these datasets into memory. We use the popular LIBSVM format i n this
assignment.
The format of
Step 2: Implement Basic Classification Method
In this step, you will implement the basic classification method, i.e.,
decision tree with gini-index as the attribute selection measure (note: the
gini-index score is computed by considering every possible categorical value
of the feature as a branch). Your classification method should contain two
steps, training (build a classifier) and test (assign labels to unseen data
instances). Many existing classification or prediction packages have the
feature to save the learned classifier as a file, so that users can load the
classifier into the framework later and apply the same classifier on different
test datasets asynchronously. However, in this assignment, we are going to
skip this save-classifiers-as-files feature, and assume that the training and
test processes happen in the same run. Before you begin with your
implementation, please first read the Project Organization and Submission
section at the end of the assignment to get a general idea on how to organize
your project and how to name your programs so the auto-grading system can
compile and execute your programs correctly.
The classification method you implemented should be able to take both training
file and test file as input parameters from the command line, use training
dataset to build a classifier and test dataset with labeling information to
evaluate the quality of the classifier. For example, if you are implementing
Decision Tree using C/C++, your program should be able to finish the entire
classification process when you type this command:
./DecisionTree train_file test_file
If you use Java:
java classification.DecisionTree train_file test_file
If you use Python:
python DecisionTree.py train_file test_file
Your program should output (to stdout) k*k numbers (k number per line,
separated by space, k lines total) to represent the confusion matrix of the
classifier on testing data, where k is the count of class labels. These lines
are sorted by the label id. In i-th line, the j-th numbers are the number of
data points in test where the actual label is i and predicted label is j. And
you will be needing these values in Step 4.
An example output can be found as follows (when k=4):
50 30 16 23
12 43 21 9
0 21 71 12
1 23 11 67
Note that if k=2, the output is a 2x2 matrix, which is similar to the table
where the numbers represent true/false positive/negative. For more information
about confusion matrix and how to evaluate performance based on it, please
read https://en.wikipedia.org/wiki/Confusion_matrix
or watch the video.
Step 3: Implement Ensemble Classification Method
In this step, you will implement an ensemble classification method using the
basic classification method in Step 2, i.e., random forest. Very similarly,
your ensemble classification method should take train_file and test_file as
input, and output k*k values in k lines as the confusion matrix so that you
can use these numbers to evaluate classifier quality in Step 4. For example,
you are implementing random forest using C/C++, and the command line of
running your program should look like this:
./RandomForest train_file test_file
If you use Java:
java classification.RandomForest train_file test_file
If you use Python:
python RandomForest.py train_file test_file
Step 4: Model Evaluation and Report
In this step, you will apply both methods you implemented in Step 2 and Step 3
on three real-world datasets and a synthetic dataset. Note that we
preprocessed these datasets and partitioned them into training and test for
you (click on the numbers in the table to download), so please do not use the
original datasets directly.
If you are curious about the the semantic meaning of each attribute or labels,
please read the source links associated with the first three datasets. The
fourth dataset is generated via distributed representation learning on a
synthetic social network (a.k.a, network embedding), where binning is further
performed to generate categorical attributes. You do not need to know these
details in order to finish this assignment.
The main difference between the fourth dataset and the other three is that the
former has a significantly larger number of attributes.
You need to apply both your basic classification method and your ensemble
classification method on each dataset, and calculate the following model
evaluation metrics using the output of your classification methods for both
training and test datasets. Please do this manually or write a separate
program (you do not need to turn this in), by taking the k*k numbers as input.
Do not output these metrics from your classifiers directly since the auto-
grader won’t be able to recognize them.
For overall performance, output:
Accuracy (overall accuracy = sum of the diagonal of your output matrix / sum
of all entries in your output matrix)
For each class, output:
Accuracy, Specificity, Precision, Recall, F-1 Score, F \beta score (\beta = 0.5 and 2)
Note 1
To evaluate the performance on each class, we regard the target class as
positive and all others as negative.
Note 2
The detailed equations of the above measures based on Confusion Matrix can be
found in https://en.wikipedia.org/wiki/Confusion_matrix
Now you are ready to write your report. You should include the following
sections in your report:
- brief introduction of the classification methods in your classification framework;
- all model evaluation measures you calculated above ((1 overall metric + # class * 7 per-class metrics) * (1 on training set + 1 on test set) * 2 methods * 4 datasets);
- parameters you chose during implementation and why you chose these parameters; and
- your conclusion on whether the ensemble method improves the performance of the basic classification method you chose, why or why not.
Project Organization and Submission
The auto-grading process will be performed on a Linux server, so make sure
your source code can be compiled in the EWS Linux environment (or in major
distro like Ubuntu, Mint, Fedora or Arch with a relatively new version of
gcc/g++ or JDK).
If you use C/C++
You must attach a Makefile so that the grader can compile your programs
correctly. If you are a Windows user and use Visual Studio as your IDE, please
add a Makefile after you finish implementation and make sure your code can be
compiled properly using gcc/g++. Your project structure should look like this:
yournetid_assign4/
|—————-report.pdf
|—————-code/
|—————-Makefile
|—————-*.c/c++
After compilation, your Makefile should be able to generate binary target
files. Your binary files should be DecisionTree and RandomForest. Also, as we
discussed in Step 2 and 3, the input arguments for your binary should look
like this:
./DecisionTree train_file test_file
If you use Java
Very similarly, please make sure your code can be compiled in Linux
environment. Your entrance class should be placed under classification
package, and your project structure should look like this:
yournetid_assign4/
|—————-report.pdf
|—————-code/
|—————-other_packages/
|—————-.java
|—————-classification/
|—————-.java
The grader will simply execute javac classification/*.java to compile your
programs, and your program should be able to generate binary files with
corresponding classification method names. The input arguments are defined as
follows.
java classification.DecisionTree train_file test_file
If you use Python
Very similarly, please make sure your code can be executed in Linux
environment. Your project structure should look like this:
yournetid_assign4/
|—————-report.pdf
|—————-code/
|—————-*.py
The grader will run the following command to test your programs:
python DecisionTree.py train_file test_file
After your finish your implementation and report, please compress your project
into zip format (please make sure after unzip the submission, everything will
be in a folder with “yournetid_assign4” as the name rather than layers of
nested folders), and rename your zip file to yournetid_assign4.zip.
Parameter settings for each dataset
Please hardcode your choice of parameters (depth of decision tree, number of
trees in random forest, etc.) for each dataset into your program. In other
words, your program should choose the corresponding parameters according to
the input datasets. Specifically, if train_file (the path to the training set)
contains the string “balance.scale” the parameters tuned for the provided
balance.scale dataset are used in this round of execution. The same goes for
nursery, led, and synthetic.social.
Besides the four provided pairs of training-test files, we also hold some
unseen datasets for the auto-grader to evaluate your implementation. The
unseen datasets differ from the provided datasets in that the training-test
files are resampled, while the raw data used to generate the unseen datasets
are the same as those used to generate the provided datasets. The filename of
any unseen datasets would also contain balance.scale, n ursery, led, or
synthetic.social. As a result, the parameters you selected for the provided
datasets will also work fine with the unseen datasets.
Double check before you submit
Your programs should be able to handle both absolute paths and relative paths
to different training and test files. Do not assume that the files are located
in the same folder of your programs.
The unseen datasets are roughly the same scale as the 4 datasets that we
provided. Make sure that your programs can finish within 3 mins for each
dataset. The auto-grader will shut down your program after 3 mins if no
results are detected (and you will receive 0 points for the corresponding
step).
In the past, when the same auto-grader was used for programming assignment,
around 20% of the submissions would not pass the auto-grader, mostly because
these submissions did not strictly follow the project organization rules we
stated above. In this case, TAs will manually grade these assignments but 6 to
14 points will be taken off from your assignment. Some common mistakes include
using Java without package structure, naming the program by decisiontree
instead of DecisionTree, generating incorrect zip file name or folder name
after unzip, etc.
Now you can submit your assignment through Compass 2g!!
Congratulations, you just finished the 2nd (and last) programming assignment
of CS412!!
Frequently Asked Questions
How to deal with unseen values or unseen features in test dataset?
In Decision Tree, this is hard to deal with during testing. You can ignore a
feature if you never saw it in training however you need to bypass a unseen
value in test data for a seen feature. (Dec. 6 update: in this homework, you
can assume no feature/index unseen in the training set will appear in the test
set; when we say ignoring a feature here, we intend to provide a solution to
the more general application scenario.)
You can do 1) randomly choose a label, 2) randomly choose a path to finish
traversing the tree, or 3) choose a path which most data flow if you kept this
information in your decision tree.
There are more elegant ways to handle such cases but since we did not cover
them in the lecture notes, you can go with one of the naive ways we mentioned
above.
Should random forest use overlapped attributes or overlapped subsets?
Yes.
How to handle duplicate tuples in training?
You should count each tuple as one when you are building classifiers no matter
whether they are duplicate tuples or not.
How many branches should a feature have during tree construction?
Since assignment 4 only deals with categorical features, we don’t do value
grouping nor select splitting point for the feature. We simply regard every
possible value as one branch. Thus the # of branches should be the same as #
of possible values of the feature.
Should parameters be tuned on a validation set?
In principle, model parameters are supposed to be tuned according to the model
performance on the validation set, which is distinct from the training test
and the test set. However, in this homework, we do not hold a validation set
for simplicity, and you would tune the parameters on the test set.