Hadoop代写:CSE3BDCBigDataToolsTask1


Introduction

大数据的一个作业,要求使用MapReduce, Hive和Spark来对大数据进行处理。
工作量主要是体现在环境的搭建,时间主要是花在数据的导入上面,此外代码的调试也是花时间的一个地方。
Task1涉及到MapReduce以及Hive的编程。

Objectives

  1. Gain in depth experience playing around with big data tools (MapReduce, Hive and Spark).
  2. Solve challenging big data processing tasks by finding highly efficient solutions.
  3. Experience processing three different types of real data
    a. Standard multi-attribute data (Bank data)
    b. Time series data (Twitter feed data)
    c. Bag of words data.
  4. Practice using programming APIs to find the best API calls to solve your problem. Here are the API descriptions for MapReduce, Hive and Spark (especially spark look under RDD. There are a lot of really useful API calls).
    * [MapReduce] https://hadoop.apache.org/docs/stable/api/
    * [Hive] https://cwiki.apache.org/confluence/display/Hive/LanguageManual
    * [Spark] http://spark.apache.org/docs/latest/api/scala/index.html#package
    * If you are not sure what a spark API call does, try to write a small example and try it in the spark shell.

Expected quality of solutions

a) In general, writing more efficient code (less reading/writing from/into
HDFS and less data shuffles) will be rewarded with more marks.
b) All MapReduce code you submit must be able to be compiled using the command
javac -classpath hadoop classpath
on the Cloudera VM you received from us without requiring the installation of
additional components.
c) All MapReduce code you submit should be runnable using
hadoop jar
For task 2C you need to allow the user to specify another two parameters being
the x and y months respectively.
d) Using multiple MapReduce phases maybe appropriate for some of the subtasks.
However, if you utilize multiple phases to solve a task, maintain a meaningful
and logically consistent naming scheme for your files. (e.g.: Phase1.java,
Phase2.java, …)
e) For hive and spark code submissions, ensure that all commands relevant to
accomplish the sub-task (i.e. ‘create table’ (hive), loading data AND
queries!) are in the same file.
f) Scalability of the code is very important. This is especially important in
terms of memory requirements of the mappers and reducers. For example writing
a mapper that outputs the same key for any input, will result in all the data
going to a single reducer (no matter how many reducers you set). For example,
if your mapper takes any string as input and always outputs the same key abc.
This effectively means you will end up writing a sequential program. This is
completely unacceptable and will result in zero marks for that subtask.
g) This entire assignment can be done using the Cloudera virtual machines
supplied in the labs and the supplied data sets without running out of memory.
Note task 3 is especially hard to do without running out of memory. But it is
possible since we had done it. So it is time to show your skills!
h) Using combiners or local aggregation (inside the mapper) for MapReduce
tasks where appropriate will be rewarded with marks. We will be looking at the
total amount of data shuffled and awarding higher marks to lower amount of
data shuffled.
i) Where ever appropriate use the fact the data is sorted according to
intermediate key to reduce the work of the mapper and/or reducer.
j) I am not too fussed about the layout of the output. As long as it looks
similar to the example outputs for each task. That will be good enough. The
idea is not to spend too much time massaging the output to be the right format
but instead to spend the time to solve problems.
k) For Hive queries. We prefer answers that use less tables.

Do the entire assignment using the Cloudera VM. Do not use AWS.
Tips:

  1. Look at the data files before you begin each task. Try to understand what you are dealing with! You may find the shell commands “cat” and “head” helpful.
  2. For each subtask we give very small example input and the corresponding output in the assignment specifications below. You should create input files that contain the same data as the example input and then see if your solution generates the same output.
  3. In addition to testing the correctness of your code using the very small example input. You should also use the large input files that we provide to test the scalability of your solutions.

Task 1: Analysing Bank Data

We will be doing some analytics on real data from a Portuguese banking
institution. The data is related to their marketing campaign.
The data set used for this task can be found inside the bank directory of the
assignment_datafiles.zip on LMS.
The data has the following attributes

Attribute number Attribute name Description
1 age numeric
2 job type of job (categorical: “admin.”, “unknown”, “unemployed”,
“management”, “housemaid”, “entrepreneur”, “student”, “blue-collar”, “self-
employed”, “retired”, “technician”, “services”)
3 marital marital status (categorical: “married”, “divorced”,
“single”; note: “divorced” means divorced or widowed)
4 education (categorical: “unknown”, “secondary”, “primary”,
“tertiary”)
5 default has credit in default? (binary: “yes”, “no”)
6 balance average yearly balance, in euros (numeric)
7 housing has housing loan? (binary: “yes”, “no”)
8 loan has personal loan? (binary: “yes”, “no”)
9 contact contact communication type (categorical: “unknown”,
“telephone”, “cellular”)
10 day last contact day of the month (numeric)
11 month last contact month of year (categorical: “jan”, “feb”, “mar”,
…, “nov”, “dec”)
12 duration last contact duration, in seconds (numeric)
13 campaign number of contacts performed during this campaign and for
this client (numeric, includes last contact)
14 pdays number of days that passed by after the client was last
contacted from a previous campaign (numeric, -1 means client was not
previously contacted)
15 previous number of contacts performed before this campaign and for
this client (numeric)
16 poutcome outcome of the previous marketing campaign (categorical:
“unknown”, “other”, “failure”, “success”)
17 Term deposit has the client subscribed a term deposit? (binary:
“yes”,”no”)
Here is a small example of the bank data that we will use to illustrate the
subtasks below (we only list a subset of the attributes in this example, see
the above table for the description of the attributes):
job marital education
management Married tertiary
technician Divorced secondary
entrepreneur Single secondary
blue-collar Married unknown
services Divorced secondary
technician Married tertiary
Management Divorced tertiary
technician Married primary
Using the entire bank data set downloaded from LMS please perform the
following tasks. Please note we specify whether you should use [MapReduce] or
[Hive] for each subtask at the beginning of each subtask.
a) [MapReduce] Report the number of clients of each job category. For the
above small example data set you would report the following (output order is
not important for this question):
management 2
technician 3
blue-collar 1
services 1
entrepreneur 1

b) [Hive] Report the rounded average yearly income for all people in each
education category. For the small example data set you would report the
following (output order is not important for this question):
tertiary 1031
secondary 287
primary 10
unknown 1506
c) [Hive] For each marital status report the percentage of people who have a
personal loan. Hint you may need to use multiple queries or subqueries. For
the small example data set you would report the following (output order is not
important for this question):
Married 50%
Divorced 67%
Single 0%
d) [MapReduce] Group balance into the following three categories:
a. Low-infinity to 500
b. Medium 501 to 1500
c. High 1501 to +infinity
Report the number of people in each of the above categories. For the small
example data set you would report the following (output order is not important
in this question):
Low 4
Medium 2
High 2
e) [MapReduce] For each education category report a list of people in
descending order of balance. For each person report the following attribute
values: education category, balance, job, marital, loan. Note this subtask can
be done using a single or multiple MapReduce tasks. For the small example data
set you would report the following (output order for education does not matter
but order does matter for the attribute balance):
primary, 10, technician, married, no
secondary, 829, services, divorced, yes
secondary, 29, technician, divorced, yes
secondary, 2, entrepreneur, single, no
tertiary, 2143, management, married, yes
tertiary, 929, technician, married, yes
tertiary, 22, management, divorced, no
unknown, 1506, blue-collar, married, no


文章作者: SafePoker
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 SafePoker !
  目录