代写R语言的作业,对指定的数据进行线性运算、统计分析等。
Part 1
Introduce yourself on D2L by posting to the Class Introductions forum on D2L.
Include a bit of information about yourself including some of the following.
Note, this
- a. The college you are in and the degree you are pursuing
- b. Your work background, especially as it relates to data analysis. What is it that brings you to this class, and what is your interest in multivariate analysis
- c. What kind of data are you interested in pursuing (this is useful for forming groups in the first two weeks
The following three problems are due by the second lecture. Be prepared to ask
questions about the math, if you have them, at the beginning of the second
lecture.
Part 2
Perform, by hand, the following calculations from linear algebra. For the
following matrices and vectors. Submit a scanned copy of your answers (no cell
phone photos).
Part 3
In R, write a script to compute each of the parts in problem 1 to check your
answers. Submit both the .r file and the output. Then, create a dataset with x = <5, -3, 2, 4>
and y = <2, 1, -1, 3>
and run a regression analysis on
the data. Compare your value for in part j of the last problem, with the
coefficients calculated by R’s lm function.
A few commands in R will help
- a. as.matrix(vector or data.frame) to convert data to matrices
- b. M = matrix(c(entries by column), nrow = [#rows], ncol = [#col])
- c. v = c(entries)
- d. t(M) for transpose, det(M) for determinant
- e. ginv(M) for inverse… note you will need the MASS package loaded for this
- f. %*% for matrix multiplication
- g. fit = lm(y ~ x, data = dataset)
- h. summary(fit)
- i. for the dot product, see the lecture on how you can do it with matrix multiplication
Part 4
Use the dataset “mtcars” which is built-in RStudio. You can see the structure
of the data by the command “head(mtcars)”. Perform the following operations
- a. Create a copy of the dataset called A with only the columns {cyl, disp, hp, wt, carb}. Use the column selection mechanism we covered in class to select these columns from the dataset.
- b. Add a column of ones to A called “count”.
- c. Use the “as.matrix” function to convert it to a matrix and assign it back to the variable A (so you are overwriting the data.frame here and converting it to a matrix)
- d. Compute the following multiple regression by manually computing the matrix operations.
- e. Compute the regression with the RStudio “lm” command and compare with your results from d). Note any differences.
The following problems may be completed with any statistical software you
wish. Make sure that the output is clearly indicated and explained in your
answer to the problem.
Part 5
Every four years, many of the world’s greatest athletes gather to participate
in the Summer Olympics. In addition to individual (or team) prowess, the
Olympics is also a highly-watched pageant of national pride and competition.
The data set (Olympics.xls under the course documents for homework 1) for this
problem concerns the performance of various countries in the 2012 London
Summer Olympics. For each included country, the data contains medal counts,
number of athletes (by gender), national population figures, and national GDP
(gross domestic product).
It is your job to distill an interesting story or insight in this data,
suitable for presentation to the general public. You must choose the message
you would like to communicate. It will take some investigation for you to find
that message. Is there an important trend or lesson that you would like the
public to understand? For example, are there ways to evaluate a country’s
“performance” beyond raw medal counts, and if so, do any surprises emerge? Is
there any relationship between the success in Olympics game and the wealth of
the people in country? How good/bad are they compared to the peers?
You may dry different multiple-regressions and plots and can compare these
results to automatic variable selection methods. Be very thorough. In your
write-up, be sure to include the graph(s) and analyses you are using to see
the relationships and clearly indicate the intended message of your graphs and
analyses.
Part 6
In a study of genetic variation in sugar maple, seeds were collected from
native trees in the eastern United States and Canada and planted in a nursery
in Wooster, Ohio. The time of leafing out of these seedlings can be related to
the latitude and mean July temperature of the place of origin of the seed. The
variables are X1 = latitude, X2 = July mean temperature, and Y = weighted mean
index of leafing out time. (Y is a measure of the degree to which the leafing
out process has occurred. A high value is indicative that the leafing out
process is well advanced.) The data is in the file maple.txt on the course web
page under the documents for week 2.
- a. Find the regression of LeafIndex on Latitude. Is latitude a useful predictor of leaf index?
- b. Repeat part (a) for the regression of LeafIndex on JulyTemp.
- c. Find the regression of LeafIndex on Latitude and JulyTemp. Compare the results of this analysis with your results from (a) and (b). How different are the slope coefficients in each case?
What best explains the differences in their values? - d. What statistical measure(s) can you use to detect and quantify this issue? What are the value(s) of these measures for this regression analysis?
Part 7
The data in the file chicinsur.txt are collected from 47 zip-code areas in the
Illinois area. There are 8 columns in the data file but not all are relevant
here. The response variable of interest is the number of new home insurance
policies (NEWPOL) (minus canceled policies) per 100 housing units. The
predictor variables are the percent minority population living in the area
(PCTMINOR), the number of fires per 1000 housing units (FIRES), the number of
thefts per 1000 in population (THEFTS), the percent of housing units built
before 1940 (PCTOLD), and the median income (INCOME). We are interested in
which predictors are significant predictors of insurance policies issued.
- a. Before running any regressions make a prediction as to what the sign of the coefficient of each predictor should be expected to be. Obtain the correlation matrix for the variables PCT-MINOR FIRES THEFTS PCTOLD INCOME NEWPOL. Do the simple correlations support your predictions about the signs?
- b. Run a multiple regression of NEWPOL on the variables listed above.
- i. Comment on the overall significance of the regression fit.
- ii. Which predictors have coefficients that are significantly different from zero at the .05 level?
- iii. Do any of the predictors have signs that are different than suggested by their simple correlations? If so, explain what may be happening. If not, explain how such a thing can happen.
- iv. Examine a plot of residuals versus predicted values. Do you see any problems?
Part 8
The Housing dataset (under the course documents for week 3) contains housing
values in the suburbs of Boston. The detailed explanation concerning the input
and output variables can be fetched from the UCI machine learning repository
(Note that in R, you can load in this file with simply
“read.table(“housing.dat”)”. If you try to specify a separator, R will get
confused by the multiple spaces between fields.
- CRIM: per capita crime rate by town
- ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS: proportion of non-retail business acres per town
- CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX: nitric oxides concentration (parts per 10 million)
- RM: average number of rooms per dwelling
- AGE: proportion of owner-occupied units built prior to 1940
- DIS: weighted distances to five Boston employment centers
- RAD: index of accessibility to radial highways
- TAX: full-value property-tax rate per $10,000
- PTRATIO: pupil-teacher ratio by town
- B: 1000(Bk - 0.63)^2 where Bk is the proportion of African Americans by town
- LSTAT: % lower status of the population
- MEDV: Median value of owner-occupied homes in $1000’s (output variable)
- a. Fit a linear regression model of CRIM based on the other variables and report goodness of fit, the utility of the model, the estimated coefficients, their standard errors, and statistical significance.
Interpret your results. - b. Perform a feature selection on this data by using the forward selection method of the regression analysis. Analyze the output in terms of the order in which the variables are included in the regression model.
- c. Compare the model selected by forward selection to backward selection.
Part 9
Post a comment to the “Lecture 1 & 2” discussion forum regarding a topic from
lectures 1 & 2.
In your post, you may address topics that you found most interesting, topics
that you would like to hear more about, or topics that you found confusing and
you would like more clarification. Please also take the time to respond to
your classmates’ questions and comments (respectfully of course ).