用R语言进行数据分析和机器学习。
Instructions
This take-home is cumulative, covering all material presented in class and on
homework. All work must be submitted electronically. Use of computational
tools (e.g., R) is encouraged; and when you do, code inputs and outputs must
be shown in-line (not as an appendix) and be accompanied by plain English that
briefly explains what the code is doing. Extra credit, augmenting your score
by at most 10%, is available for (neatly formatted) solutions authored in
Rmarkdown, and submitted as a working .Rmd file.
- Students must limit themselves to methods and libraries discussed in class. For full credit all steps must be shown.
- Most importantly, all work must be your own. In this exam setting, communicating with others about the problems or solutions is not allowed, and doing so will be considered a breach of the honor code. All questions of clarification, etc., must be directed to Prof. Gramacy.
- The problems below are deliberately open-ended. In some cases there may be several “correct” answers. The questions are similar in scope to a homework questions, but there is no prompting about what to do or in what order. Explaining why is as important as what you’re doing; you are being tested your instincts and ability to be thorough (without being pedantic) and on your execution (ability to get the job done). Include the necessary plots, tests, diagnostics, and model probabilities to illustrate and support your conclusions. Presentation matters, and longer is not better.
- Be careful: any “data mining” you do may be computationally intensive. Don’t leave these problems to the last minute: allow your computer some time to work on your behalf while you are doing something else.
Problem 1: Electricity Demand
The file elec.csv contains data on the rate, measured in megawatts (MW), of
electricity delivered to Gulf Energy customers in Alabama. Also provided are
the average daily temperature readings (temp in degrees Fahrenheit) in that
market for each of the 364 days of the study, which started on January 1, a
Sunday. To operate effectively, power companies must be able to predict daily
peak demand for electricity.
Your task is to provide a fitted model for forecasting daily electricity
demand. In addition to describing your modeling enterprise, you might consider
the following in addition to anything else you deem relevant.
- Comment on the accuracy of your forecaster with particular focus on peak demand.
- What does your fitted model forecast for the last day of the year, i.e., for the day following the last day in the data? Be sure to include uncertainty estimates.
- Consider extending your method to provide forecasts (and uncertainties) for each day of the first full week of the following year.
Problem 2: Racial profiling?
This question investigates whether or not there is a systematic racial bias in
who is stopped by Washington State Police (WSP) officers. The data (in
wspts.csv) consists of the number of both officer-initiated traffic stops
(e.g., without a radar trap or a crash) and radar initiated traffic stops
recorded for each of six racial groups between November, 1, 2005 and September
30, 2006 for 34 autonomous patrol areas (APAs). So, for example, the first
observation in the data file tells us that in APA 2, 11445 white people were
stopped at the discretion of a WSP officer and 2531 white people were stopped
due to indications from radar. You may find it helpful to note that APAs are
roughly ordered by distance from Seattle.
Your task is to build and fit an appropriate model (or models) in order to
provide evidence for or against racial bias by relating officer initiated
traffic stops to radar initiated traffic stops. Use radar initiated traffic
stops as a benchmark for the population that is at-risk to be stopped by the
WSP. These drivers are selected from passing motorists based upon driving
characteristics, and there is very little chance of racial bias. If members of
a particular race are actively stopped (at the discretion of a WSP officer) at
a different rate than predicted by this benchmark, we have evidence of racial
bias.
Note: Many researchers suggest that a difference between the racial
distribution of persons stopped by police and the racial distribution of the
population at risk of being stopped would constitute evidence of racial
profiling. This implicit definition reveals the key empirical problem in
testing for racial profiling: measuring the risk set, or the benchmark racial
distribution, against which to compare the racial distribution of traffic
stops by officers.
Problem 3: Spam
The data for this question considers attributes of e-mails collected at HP
Labs in Palo Alto, CA, in the late 1990’s. The file spam.csv comprises of 57
attributes of 4601 e-mails, and a human-assigned label (spam: 1 if it is spam,
or 0 if not) indicating whether or not the e-mail was spam. The attributes
are:
- 48 of them are word frequencies (named w_word where word is a particular word). Some are numbers, including common telephone prefixes.
- 6 of them are character frequencies (named c_char where char is one of “;”, “(“, “[“, “!”, “$”, or “#”).
- The final three (using the prefix caps_) give the length of the average, longest, and total strings of capital letters in the e-mail.
Your task is to build a spam filter, i.e., to build a predictor for detecting
which new e-mails are spam and which are not. In addition to describing your
modeling enterprise, and commenting on out-of-sample accuracies and anything
else you deem relevant, you might consider the following. - What is the accuracy of your spam filter out-of-sample?
- Try (linear) models both with and without interaction terms. Remember the response is binary.
- Try nonlinear methods. Again, remember the response is binary.
- For more info see the UCI page. In particular, note that “false positives (marking good mail as spam)” may be undesirable