完善一个叫Scalica的项目,实现要求的新功能,最终需要部署到Google Cloud Engine上。
Goal
The final result we want to achieve is to extend the function of Scalica so
that it can automatically match two users for a dinner date.
In general the project could be divided into three subsections:
- Requesting and storing user input
- Analyzing user input and generating a matching rank
- Presenting matching information to user
Firstly, we will want to augment the functionality in Scalica. We would like
to ask the users some questions. There are 20 rating questions such as “Do you
like spicy food?” “Do you like French food?” each question with a scale from 1
to 5 (1: dislike – 5: Like a lot). The service randomly chooses 3 questions,
and ask the user to indicate their preference. These data will be stored in a
table with userID as the primary key, and each column representing a question
storing the rating value. All these data are considered to be private data. So
we will also have to restrict the viewing or changing of them to unauthorized
users. For example, some pervert who is trying to setup a date for himself and
another person he fancies will not be able to do that.
A is responsible for this part (augmenting Scalica (and its data models) to
store additional info), and he would like to use either Java or Python and
Django for this part. More on database design: to make the database scalable,
we will use horizontal partitioning to partition the user info data. Even
though we only get one physical node for database, we will create some logical
shards so that it will be easy when the amount of data grow and we need to
redistribute them across different physical nodes. Each time a new entry is
created, its ID will contain an identifier marking the logical shard which
it’s in.
Secondly, we will want to use these data to match these people. We will be
using the latent factor model to achieve this. Normally, the latent factor
model will have a data table that store each user and their preference level
to multiple labels. In our ca se, this matrix A will be a matrix of users vs.
all the different labels created by users. These labels are the favourite
books, users’ jobs, their favourite movies and etc. In the actual matrix, each
value of the matrix element is the rating we got from first step corresponding
to (user, label). Since it is unrealistic for users to rate their preference
level for each of the label in the matrix, there will be many blanks in the
matrix. We will then compute the latent factor for each user to the other
labels to substitute these blanks. To compute the latent factors. We would
need to create two matrix out of the original matrix A. One would be the
matrix that represents users vs. flavors. The other one would be flavors vs.
labels. By doing a gradient descent on the product of these two matrices, we
would be able to get the latent factor matrix. We could then use the latent
factor matrix to get labels that were not rated by the user but now it is. For
the completed matrix, each row would be the interpreted preference array for
the user against the labels. We then find out for each user, whose preference
array is most alike. We would do this by computing the dot product of the
difference array between two users by itself. The users with lowest three such
values will be recommended to each other. Depending on the number of users we
have and the frequency they choose to update their preference information, we
will need to update the preference matrix in different windows of time. B will
be responsible for the second part. For the language used here. He is
considering either Python or C++. Former for the reason that it is easy to do
matrix arithmetic with Python. The later been that it might be faster to use
C++. All the above service will be run on a cloud server from Google’s Cloud
Engine. We will be using Redis to store the list of suggestions for it is
better supported than memcached.
Thirdly, we will have two different database, one is for the user information
and the other is for the result of survey. Both of them will be horizontal
partitioned and logical sharded for the expansion in the future. The host of
matching algorithm will analyze the data from the survey database. At the end
of every finished calculation, it will update all changes to the suggestion
list in Redis. The algorithm will run every time when there is a change in
survey database. C is responsible to construct survey database.
Optional: We will need to be presenting these informations to the users. We
will need to guarantee that both user’s personal information is not leaked in
anyway to the other prior to their meeting. First, we will present the
suggested restaurant, the price range, and the dining time to both users in
the match. If both users agree to meet, the system will provide them more
detailed and personal information, such as interests, favorite food and etc.
If one of the user in the match decides to not meet, the system will hide all
the information presented before from both users and suggest a new match to
both users according to the next higher rank in the system. The third part
require techniques in visualizing the data. After achieving basic function
mentioned above, another main focus is beautifying the user interface and the
data visualization. C is responsible for this part, and he would like to use
python and Django for this part. The packages I might use are Bokeh for
visualization with Python.