代写MongoDB的作业,作业类型其实偏向数据分析而不是MongoDB的用法。数据集很大,实现要求的查询需求即可。
Introduction
In this assignment, you will show that you can work with different NoSQL
systems for data persistence and understand the strength and weakness of each
system with respect to certain workload features. You are asked to work with a
music data set and a list of target queries. You will design data schema based
on the data set feature and the given query workload for MongoDB, Neo4j and
HBase respectively. You will show that your design can support the target
queries by load the data in each system following your schema and run queries
against the data.
Data set
The data that you will use is the Last.fm ( http://www.last.fm
)data set released in Het-Rec2011 (
http://ir.ii.uam.es/hetrec2011/ ). You can
view the details of the data set and download it from
http://grouplens.org/datasets/hetrec-2011/
.
The data set contains information about “social networking, tagging, and music
artist listening information from a set of 2k users from Last.fm online music
system”. The data is organized as relational tables and is stored in several
text files, each corresponding to a table. All files are of tab separated
format. Unique IDs are assigned to artists, users and tags for easy
referencing across files.
Basic information about artists is stored in file artists.dat. Each line
contains four columns: id, name of the artist, an url pointing to the
description of the artist and another pictureURL pointing to a picture of the
artist. The tag data is stored in another file tags.dat with only two columns:
the tagID and the actual tagValue. The data set does not contain any personal
information about user, hence there is no separate file for users. All other
files in the data set contain information about some relationships.
The user artists.dat stores information about listening count per user per
artist. The user friends.dat stores the friend relations between users. The
tag assignments of artists by users are stored in two files. The only
difference between the files is the timestamp format. The file user
taggedartists.dat stores the day, month and year in separate columns while
user taggedartists-timestamps.dat stores the unix timestamp in a single
column. You only need to use one of the files for tag information in this
assignment.
Target Queries
Simple query
- given a user id, find all artists the user’s friends listen.
- given an artist name, find the most recent 10 tags that have been assigned to it.
- given an artist name, find the top 10 users based on their respective listening counts of this artist. Display both the user id and the listening count
- given a user id, find the most recent 10 artists the user has assigned tag to.
Complex queries
- find the top 5 artists ranked by the number of users listening to it
- given an artist name, find the top 20 tags assigned to it. The tags are ranked by the number of times it has been assigned to this artist
- given a user id, find the top 5 artists listened by his friends but not him. We rank artists by the sum of friends’ listening counts of the artist.
- given an artist name, find the top 5 similar artists. Here similarity between a pair of artists is defined by the number of unique users that have listened both. The higher the number, the more similar the two artists are.
Tasks
Your tasks include following.
Schema Design for each system
You should provide three schema design versions. For MongoDB and Neo4j, your
schema should support both the simple queries and the complex queries. For
HBase, your schema only needs to support the simple queries. For each schema
version, make sure you utilize features provided by the storage system such as
indexing, aggregation, ordering, filtering and so on. Please note that your
schema may deviate a lot from the relational structure of the original data
set. You can discard IDs if you find they are not useful. You can duplicate
data if you find that helps with certain queries. You will not get any mark if
you present a schema that is an exact copy of the relational structure in the
original data set.
Query Design and Implementation
For MongoDB, load the complete data on MongoDB and set up proper indexes that
will be used by the target queries. Design and implement all target queries.
You may implement a query using shell command, a combination of JavaScript and
shell command or as Python/Java program. For each query (or sub query), report
execution statistics such as: which index is used, how many documents are
examined to answer this query.
For Neo4j, load the complete data on Neo4j and set up proper indexes that will
be used by the target queries. Design and implement all target queries. You
may implement a query using cypher command or as Pyton/Java program. For each
query, report execution statics such as which index is used, how many records
are examined, whether or not a full scan is involved.
For HBase, load a small subset of data that can demonstrate the simple
queries. Design and implement only the simple queries. You may implement a
query using shell command, a combination of Ruby script and shell command, or
as Python/Java program. You can use filter as well. For each query, describe
the number of rows, or subset of columns are examined. Especially, highlight
if a full table scan is involved in answering this query.
Deliverable and Submission Guideline
This is a group project, each group can have up to 3 students. Each group
needs to produce the following.
A Written Report
The report should contain five sections. The first section is a brief
introduction of the project. Section two to four should cover a system each.
Section five should provide a comparison and summary.
There is no mark on section one. It is included to make the report complete.
So please keep it really short.
Section two to four should contain the following two sub sections
Schema Design
In this section, describe the schema with respect to the particular system.
Your description should include information at “table” and “column” level as
well as possible primary keys/row keys and secondary indexes. You should show
sample data based on schema. For instance, you may show sample documents of
each MongoDB collection, a sample property graph involving all node types and
relationship types for Neo4j, a sample row in each HBase table. If certain
data are duplicated in different collection/tables, highlight the duplication
and briefly justify your decision.
Query Design and Execution
In this section, describe implementation of each target query. You may include
the entire shell query command, or show pseudo code. You should also run
sample queries and describe the execution statistics for each sample query as
described in the Tasks section.
In section five, compare the three systems with respect to ease of use, query
performance and schema differences. You can also describe problems encountered
in schema design or query design.
Submit a hard copy of your report together with a signed group assignment
sheet in Week 10 lab.
System Demo
Each group will demo their implementation in week 10 lab. The required data
need to be loaded in respective systems for the demo. You can run demo in your
own machine, on lab machine or on some cloud servers. The tutor will ask you
to run a few randomly selected queries to test the implementation.