代写数据分析作业,用R来回答问题。
Requirement
This question is based on the diabetes dataset (diabetes.arff). This dataset
consists of 768 observations and 9 attributes. The brief description of the
attributes are as follows:
- preg : Number of times the patient is pregnant
- plas: Plasma glucose concentration
- pres : Diastolic blood pressure (mm Hg)
- skin : Triceps skin fold thickness (mm)
- insu: 2-hour serum insulin (mu U/ml)
- mass : Body mass index (weight in kg / (height in m)^2)
- pedi : Diabetes pedigree function
- age : Age (years)
- class : Class variable (either tested_negative or tested_positive)
- a) Provide the R codes for loading the data into a variable Diabetes.
- b) Provide the R codes generating the CSV equivalent of the diabetes dataset (diabetes.csv).
- c) Compare and contrast the similarities and differences of the ARFF format and the CSV format.
- d) Provide the R codes for generating a logistic regression model (model) using class as the response and the other attributes as predictors.
- e) Using the logistic regression results of the model, write down the equation of logodds of the model. Please round off all the coefficient estimates to 4 decimal places.
- f) We learned that logistic regression uses a logistic function: Pr(Y=REFERENCE_CLASS | data) (i.e. the probability of class = REFERENCE_CLASS given a data point. It turns out that R uses the first level value of a factor-type attribute as the reference class.
- g) Provide the R codes for verifying the probability value of f) using the predict() function in R.
- h) Suppose you want to change the reference class in R to tested_positive, you could use the relevel() function. Read the help pages and provide the R command to change reference class to test_positive so that predict() will be based on tested_positive.
- i) If you were to generate a new model (model2) using tested_positive as the reference class, what is the difference in the regression model of model2 compared to model?