代写四个关于大数据的作业,涉及到Hadoop, Spark和Pig的使用,最后需要在AWS和Azure上搭建环境进行计算。
Task 1: Analyzing a Graph with Hadoop/Java
Writing your first simple Hadoop program
Imagine that your boss gives you a large dataset which contains an entire
email communication network from a popular social network site. The network is
organized as a directed graph where each node represents an email address and
the edge between two nodes (e.g., Address A and Address B) has a weight
stating how many times A wrote to B. The boss is very interested in finding
out the people most frequently contacted by others. Your task is to write a
MapReduce program in Java to report the largest weight among all the weighted
inbound edges for each node in the graph.
First, go over the Hadoop word count tutorial to get familiar with Hadoop and
some Java basics. You will be able to complete this task with only some
knowledge about Java. You should have already loaded two graph files into HDFS
and loaded into your HDFS file system in your vm. Each file stores a list of
edges as tab-separated-values. Each line represents a single edge consisting
of three columns: (source node ID, target node ID, edge weight), each of which
is separated by a tab (\t). Node IDs are positive integers, and weights are
also positive integers. Edges are ordered randomly.
src tgt weight
117 51 1
194 51 1
299 51 3
230 151 51
194 151 79
51 130 10
Your code should accept two arguments upon running. The first argument
(args[0]) will be a path for the input graph file on HDFS (e.g.,
/user/cse6242/graph1.tsv), and the second argument (args[1]) will be a path
for output directory on HDFS (e.g., /user/cse6242/task1output1). The default
output mechanism of Hadoop will create multiple files on the output directory
such as part-00000, part-00001, which will be merged and downloaded to a local
directory by the supplied run script. Please use the run scripts for your
convenience.
The format of the output should be such that each line represents a node ID
and the largest weight among all its inbound edges. The ID and the largest
weight must be separated by a tab (\t). Lines do not need be sorted. The
following example result is computed based on the toy graph above. Please
exclude nodes that do not have incoming edges (e.g., those email addresses
that never get contacted by anybody).
For the toy graph above, the output is as follows.
51 3
151 79
130 10
Test your program on graph1.tsv and graph2.tsv. To demonstrate how your
MapReduce procedure works, use the inline example above, trace the input and
output of your map and reduce functions. That is, given the above graph as the
input, describe the input and output of your map and reduce function(s) and
how the functions transform/process the data (provide examples whenever
appropriate). Write down your answers in description.pdf. You are welcome to
explain your answers using a combination of text and images.
Designing a MapReduce algorithm (and thinking in MapReduce)
Design a MapReduce algorithm that accomplishes the following task: for each
node i in a directed graph G, find that node’s in neighbors’ in neighbors.
Node u is considered to be an in neighbor of node v if there is a directed
edge pointing from node u to node v. In other words, your task is find every
“2-hop” neighbor of every node i in the graph G, where such a neighbor is
connected by at least one directed path of length 2 that reaches node i.
NOTE: You only need to submit pseudo code, a brief explanation of your
algorithm, and trace of input and output of your map and reduce functions
for the graph given below. No coding is required.
Task 2: Analyzing a Large Graph with Spark/Scala
Please go over this Spark word count tutorial to get more background about
Spark/Scala.
Goal
Your task is to calculate the gross accumulated node weights for each node in
graph1.tsv and graph2.tsv from edge weights using Spark and Scala. Assume the
graph to be a representation of a network flow where each edge represents the
number of items flowing from source to target. The gross accumulated node
weight for a node is now defined as the number of items produced/consumed by
the node.
When loading the edges, parse the edge weights using the toInt method and
filter out (ignore) all edges whose edge weights equal 1 i.e., only consider
edges whose edge weights do not equal 1.
Your Scala program should handle the same two arguments as in Task 1 for input
and output from the console, and should generate the same formatted output
file on the supplied output directory (tab-separated-file). Please note that
the default Spark saveastextfile method uses a saving format that is different
from Hadoop’s, so you need to format the result before saving to file (Tip:
use map and mkString). The result doesn’t need to be sorted.
Task 3: Analyzing Large Amount of Data with Pig on AWS
You will try out PIG ( http://pig.apache.org )
for processing n-gram data on Amazon Web Services (AWS). This is a fairly
simple task, and in practice you may be able to tackle this using commodity
computers (e.g., consumer-grade laptops or desktops). However, we would like
you to use this exercise to learn and solve it using distributed computing on
Amazon EC2, and gain experience (very helpful for your future career in
research or industry), so you are prepared to tackle more complex problems.
The services you will primarily be using are Amazon S3 storage, Amazon Elastic
Cloud Computing (EC2) virtual servers in the cloud, and Amazon Elastic
MapReduce (EMR) managed Hadoop framework.
This task will ideally use up only a very small fraction of your $100
credit. AWS allows you to use up to 20 instances in total (that means 1
master instance and up to 19 core instances) without filling out a “limit
request form”. For this assignment, you should not exceed this quota of 20
instances. You can learn about these instance types, their specs, and pricing
at Instance Types.
Please read the AWS Setup Guidelines provided to set up your AWS account. In
this task, you will use subsets of the Google books n-grams dataset (full
dataset for reference), on which you will perform some analysis. An ‘n -gram’
is a phrase with n words; the full n-gram dataset lists n-grams present in the
books on books.google.com along with some statistics.
You will perform your analysis on two custom datasets, extracted from the
Google books bigrams (2-grams), that we have prepared for you: a small one and
a large one. To help you evaluate the correctness of your output, we have
uploaded the output for the small dataset on T-Square (the link is here ).
VERY IMPORTANT : Both these datasets are in the US-Standard (US-East)
region. Using machines in other regions for computation would incur data
transfer charges. Hence, set your region to US East (N. Virginia) in the
beginning (not Oregon which is the default). This is extremely important
otherwise your code may not work and you may be charged extra.
Goal
For each unique bigram, compute its average number of appearances per book,
with at least 50 occurrences for each recorded year. For the above example,
the results will be:
I am (342 + 211) / (90 + 10) = 5.53
very cool (500 + 3210 + 9994) / (10 + 1000 + 3020) = 3.40049628
Output the 10 bigrams having the highest average number of appearances per
book along with their corresponding averages, in tab-separated format, sorted
in descending order, with at least 50 occurrences for each recorded year. If
multiple bigrams have the same average, order them alphabetically. For the
example above, the output will be:
I am 5.53
very cool 3.40049628
You will solve this problem by writing a PIG script on Amazon EC2 and save the
output.
You can use the interactive PIG shell provided by EMR to perform this task
from the command line (grunt). In this case, you can copy the commands you
used for this task into a single file to have the PIG script and the output
from the command line into a separate file. Please see this for how to use PIG
shell. Also, you can upload the script and create a task on your cluster.
Task 4: Analyzing a Large Graph using Hadoop service onMicrosoft Azure
Goal
Your task is to write a MapReduce program to calculate the degree distribution
of a graph. Note that this task shares some similarities with Task 1 (e.g.,
both are analyzing graphs). Task 1 can be completed using your own computer.
This task is to be completed using Azure. We recommend that you first complete
Task 1.
You will use data files small.tsv(75MB) and large.tsv(3GB), for this
question. Each file stores a list of edges as tab-separated-values. Each line
represents a single edge consisting of two columns: (Node A, Node B), each of
which is separated by a tab. Node IDs are positive integers and the rows are
already sorted by Node A.
src tgt
51 130
51 194
51 299
130 200
151 230
151 194
Your code should accept two arguments upon running. The first argument
(args[0]) will be a path for the input graph file, and the second argument
(args[1]) will be a path for output directory. The default output mechanism of
Hadoop will create multiple files on the output directory such as part-00000,
part-00001, which will have to be merged and downloaded to a local directory.
The format of the output should be as follows. Each line represents the degree
and its frequency. The degree and the frequency of the degree must be
separated by a tab(\t), and lines don’t have to be sorted. The following
example result is computed based on the toy graph above.
Hint: One way of doing it is using mapreduce procedure twice. First for
finding the degree of each node and second for calculating the frequency of
each degree. You will have to make appropriate changes in the skeleton code
for this.