使用 Logistic Regression 进行数据预测分析。
![Logistic
Regression](https://upload.wikimedia.org/wikipedia/commons/thumb/6/6d/Exam_pass_logistic_curve.jpeg/400px-
Exam_pass_logistic_curve.jpeg)
Overview
In this project you will draw on your R skills to predict loan defaults. You
are given two datasets one training dataset and one prediction dataset. You’ll
follow our process Stage, Structure, Cleanse, Transform, Explore and Model.
Your task is to build and document two models for loan approvals. Finally,
you’ll have to join approvals on the prediction dataset.
Challenge
You now work for a major financial institution as an R data scientist. You
have been asked to build a model that can be used to hedge the institution’s
default risk. To do this you will predict which loans will end up in default
status (0 - Paid in Full) or (1 - Default) status.
Deliverables for Grading
For this project, you will complete and submit the following.
- Modeling report - The report will detail the data munging/shaping, data understanding, data preparation and modeling phases of this project. In the report template (final-Project_template.docx) provided to you in Module 14, you will basically perform the steps outlined below, document your results and answer any questions that correspond to the Required Tasks (instructions) below. Rename this file to: Final_Project_Student_Name.docx for submission.
- R code and functions - I’ll want your R code and functions you authored to produce the report. Your code should be appropriately commented so I can tell what report question(s) it relates to, review this for use of authored functions (unless you enjoy doing things the hard way).
- Predictions file - This is the test set containing your loan status predictions for out of sample data. The predictions file will contain two columns: 1) the ID; and 2) Predicted Probability. You’ll name this file your_name_predictions_final.csv.
Required Tasks
Executive Summary
Here is a little more truth, sometimes executives will read your analysis and
you’ll have to explain it in language that they can understand! Your challenge
is to concisely present your findings and results without writing a book. A
good executive summary will leave the reader with a couple key takeaways that
they can remember and regurgitate at the next meeting. Your executive summary
should be just that a summary. What problem were you challenged with? What
were 3 or 4 key findings (things you found interesting that influenced the
model). What was result of your model, and any recommendations that you’d make
- maybe 2 or 3.
- State the problem
- Key findings 3-4 bullets
- Performance
- Recommendations
Helpful hint: do not attempt to draft an executive summary until after the
report has been written, i.e. this should be the last thing you should do.
Stage & Structure
- Read the data sets into R
* a. Make sure the datatypes are correct
* b. Specify your factors - Document what you did in your modeling report file and the resulting structure of the data and data types (produce a table)
Exploratory Data Analysis
- Data profiling: Provide a table of summary statistics for numeric variables (typical stuff like mean, median, nulls, not null, unique etc.) and nominal variables (count, count distinct, null, not null etc.).
- Predictors: Make a table with the top 5-10 variables that are likely to be useful predictors. Support your findings with graphics (for example histograms & scatter plots) and statistics and correlations.
- Initial screening: Which variables can you ignore and why? Document your decisions in your modeling report file.
Data Preparation
- Derive variables: drive useful new useful variables from the set, document and support your reasoning with graphics and statistics
- Deal with missing values and extremes. There are a number of missing values in the data, what strategies are you going to use to deal with those - can you replace them with 0, mean or median or could you predict what those missing values are likely to be? Document and support the reasoning behind exclusion.
- Transform categorical variables also known in R land as Factors. Categorical variables have to be dealt with, identify a handful of useful variables and apply one-hot-encoding, document what you did and describe the impact vs target.
- Cluster our data together using k-means. Follow our recipe, what variables did you use and what value of K and why. Provide supporting evidence for your choice of K with charts and tables, finally you should be able to name describe your cluster in plain English.
- Partition training data into 70/30 (training, validation) split - if you were clever you appended the “prediction” dataset to the training file at the beginning, so you don’t have to repeat all those steps again. Document the number of records per partition.
Model Building
Data Munging, Exploratory Analysis and Feature Engineering is both fun and
rewarding (and where we spent the majority of our time). Why focus so much on
that? It’s those activities that separate good data scientists from great.
However, now is the time to move on to the modeling stage by building
classification models. We’ll use AUC (area under the curve) to assess our
models. AUC is the Area under the Curve, what curve is that? The ROC curve!
ROC curve is a graphical plot that illustrates the diagnostic ability of
binary classifiers. A ROC curve simply plots the True Positive rate against
the False Positive rate. In simple terms a AUC of 0.5 or less means your model
does as good as or worse than random, thus you’ll want to get an AUC above
0.5. The higher the AUC the better.
Modeling Tasks
When the primary purpose of your model is prediction accuracy you’ll want to
present all of the predictors you’ve created to the variable selection scheme
(backwards, forwards, or stepwise) to build a model with. Here are the steps:
- Model 1 - Logistic Regression
- Use select 5 input variables and train your model
- Document the result of your logistic model on the train and test partitions
- Minimally Variable Importance, confusion matrix and accuracy
- Model 2 - Logistic Regression with Variable Selection
- Document the result of your logistic model on the train and test partitions
- Minimally Variable Importance, confusion matrix and accuracy
- Generate the predictions file on the “prediction” partition, you should create a comma separated file with two variables, ID and TARGET_DEFFAULT.