使用Yelp的数据集,完成不同类型的协作过滤推荐系统,并且熟悉 Locality Sensitive Hashing (LSH)
算法。
![Locality Sensitive Hashing (LSH)](https://ars.els-
cdn.com/content/image/1-s2.0-B9780128001615000037-f03-06-9780128001615.jpg)
Overview of the Assignment
In Assignment 3, you will complete two tasks. The goal is to familiarize you
with Locality Sensitive Hashing (LSH), and different types of collaborative-
filtering recommendation systems. The dataset you are going to use is a subset
from the Yelp dataset used in the previous assignments.
Assignment Requirements
Programming Language and Library Requirements
- a. You must use Python to implement all tasks. You can only use standard python libraries (i.e., external libraries like numpy or pandas are not allowed). There will be a 10% bonus for each task (or case) if you also submit a Scala implementation and both your Python and Scala implementations are correct.
- b. You are required to only use the Spark RDD to understand Spark operations. You will not receive any points if you use Spark DataFrame or DataSet.
Programming Environment
Python 3.6, JDK 1.8, Scala 2.12, and Spark 3.1.2
We will use these library versions to compile and test your code. There will
be no point if we cannot run your code on Vocareum. On Vocareum, you can callspark-submit
located at /opt/spark/spark-3.1.2-binhadoop3.2/bin/spark-
submit . (*Do not use the one at
/home/local/spark/latest/bin/spark-submit
(2.4.4))
Write your own code
Do not share your code with other students!!
We will combine all the code we can find from the Web (e.g., GitHub) as well
as other students’ code from this and other (previous) sections for plagiarism
detection. We will report all the detected plagiarism.
Yelp Data
In this assignment, the datasets you are going to use.
We generated the following two datasets from the original Yelp review dataset
with some filters. We randomly took 60% of the data as the training dataset,
20% of the data as the validation dataset, and 20% of the data as the testing
dataset.
- a. yelp_train.csv: the training data, which only include the columns: user_id, business_id, and stars.
- b. yelp_val.csv: the validation data, which are in the same format as training data.
- c. We are not sharing the test dataset.
- d. other datasets: providing additional information (like the average star or location of a business)
Tasks
Note: This Assignment has been divided into 2 parts on Vocareum. This has been
done to provide more computational resources.
Task 1: Jaccard based LSH
In this task, you will implement the Locality Sensitive Hashing algorithm with
Jaccard similarity using yelp_train.csv.
In this task, we focus on the “0 or 1” ratings rather than the actual
ratings/stars from the users. Specifically, if a user has rated a business,
the user’s contribution in the characteristic matrix is 1. If the user hasn’t
rated the business, the contribution is 0. You need to identify similar
businesses whose similarity >= 0.5
.
You can define any collection of hash functions that you think would result in
a consistent permutation of the row entries of the characteristic matrix. Some
potential hash functions are:
f(x)= (ax + b) % m or f(x) = ((ax + b) % p) % m
where p is any prime number and m is the number of bins. Please carefully
design your hash functions.
After you have defined all the hashing functions, you will build the signature
matrix. Then you will divide the matrix into b bands with r rows each, where b x r = n
(n is the number of hash functions). You should carefully select a
good combination of b and r in your implementation ( b > 1
and r > 1
). Remember that two items are a candidate pair if their signatures are
identical in at least one band.
Your final results will be the candidate pairs whose original Jaccard
similarity is >= 0.5
. You need to write the final results into a CSV file
according to the output format below.
Input format: (we will use the following command to execute your code)
Python: spark-submit task1.py
Scala: spark-submit –class task1 hw3.jar
Param: input_file_name: the name of the input file (yelp_train.csv), including
the file path.
Param: output_file_name: the name of the output CSV file, including the file
path.
Output format:
IMPORTANT: Please strictly follow the output format since your code will be
graded automatically. We will not regrade because of formatting issues.
- a. The output file is a CSV file, containing all business pairs you have found. The header is “business_id_1, business_id_2, similarity”. Each pair itself must be in the alphabetical order. The entire file also needs to be in the alphabetical order. There is no requirement for the number of decimals for the similarity value.
Grading
We will compare your output file against the ground truth file using precision
and recall metrics.
Precision = true positives / (true positives + false positives)
Recall = true positives / (true positives + false negatives)
The ground truth file has been provided in the Google drive, named as
“pure_jaccard_similarity.csv”. You can use this file to compare your results
to the ground truth as well.
The ground truth dataset only contains the business pairs (from the
yelp_train.csv) whose Jaccard similarity >=0.5
. The business pair itself
is sorted in the alphabetical order, so each pair only appears once in the
file (i.e., if pair (a, b) is in the dataset, (b, a) will not be there).
In order to get full credit for this task you should have precision >= 0.99
and recall >= 0.97
. If not, then you will get only partial credit based
on the formula:
(Precision / 0.99) * 0.4 + (Recall / 0.97) * 0.4
Your runtime should be less than 100 seconds. If your runtime is more than or
equal to 100 seconds, you will not receive any point for this task.
Task2: Recommendation System
In task 2, you are going to build different types of recommendation systems
using the yelp_train.csv to predict the ratings/stars for given user ids and
business ids. You can make any improvement to your recommendation system in
terms of speed and accuracy. You can use the validation dataset (yelp_val.csv)
to evaluate the accuracy of your recommendation systems, but please don’t
include it as your training data.
There are two options to evaluate your recommendation systems. You can compare
your results to the corresponding ground truth and compute the absolute
differences. You can divide the absolute differences into 5 levels and count
the number for each level as following:
>=0 and <1: 12345
>=1 and <2: 123
>=2 and <3: 1234
>=3 and <4: 1234
>=4: 12
This means that there are 12345 predictions with < 1
difference from the
ground truth. This way you will be able to know the error distribution of your
predictions and to improve the performance of your recommendation systems.
Additionally, you can compute the RMSE (Root Mean Squared Error) by using
following formula.
Where Predi is the prediction for business i and Ratei is the true rating for
business i. n is the total number of the business you are predicting.
In this task, you are required to implement:
- Case 1: Item-based CF recommendation system with Pearson similarity
- Case 2: Model-based recommendation system
- Case 3: Hybrid recommendation system
Item-based CF recommendation system
Please strictly follow the slides to implement an item-based recommendation
system with Pearson similarity.
Model-based recommendation system
You need to use XGBregressor(a regressor based on the decision tree) to train
a model. You need to use this API
https://xgboost.readthedocs.io/en/latest/python/python_api.html
, the
XGBRegressor inside the package xgboost.
Please choose your own features from the provided extra datasets and you can
think about it with customer thinking. For example, the average stars rated by
a user and the number of reviews most likely influence the prediction result.
You need to select other features and train a model based on that. Use the
validation dataset to validate your result and remember don’t include it into
your training data.
Hybrid recommendation system.
Now that you have the results from previous models, you will need to choose a
way from the slides to combine them together and design a better hybrid
recommendation system.
Here are two examples of hybrid systems:
Example 1
You can combine them together as a weighted average, which means.
The key idea is: the CF focuses on the neighbors of the item and the model-
based RS focuses on the user and items themselves. Specifically, if the item
has a smaller number of neighbors, then the weight of the CF should be
smaller. Meanwhile, if two restaurants both are 4 stars and while the first
one has 10 reviews, the second one has 1000 reviews, the average star of the
second one is more trustworthy, so the modelbased RS score should weigh more.
You may need to find other features to generate your own weight function to
combine them together.
Example 2:
You can combine them together as a classification problem:
Again, the key idea is: the CF focuses on the neighbors of the item and the
model-based RS focuses on the user and items themselves. As a result, in our
dataset, some item-user pairs are more suitable for the CF while the others
are not. You need to choose some features to classify which model you should
choose for each item-user pair.
If you train a classifier, you are allowed to upload the pre-trained
classifier model named “model.md” to save running time on Vocareum. You can
use pickle library, joblib library or others if you want. Here is an example:
https://scikit-learn.org/stable/modules/model_persistence.html
.
You also need to upload the training script named “train.py” to let us verify
your model.
Some possible features (other features may also work):
- Average stars of a user, average stars of business, the variance of history review of a user or a business.
- Number of reviews of a user or a business.
- Yelp account starting date, number of fans.
- The number of people think a users’ review is useful/funny/cool. Number of compliments (Be careful with these features. For example, sometimes when I visit a horrible restaurant, I will give full stars because
Submission
You need to submit following files on Vocareum with exactly the same name:
- a. Four Python scripts:
- task1.py
- task2_1.py
- task2_2.py
- task2_3.py
- b. [OPTIONAL] hw3.jar and Four Scala scripts:
- task1.scala
- task2_1.scala
- task2_2.scala
- task2_3.scala